Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/openvenues/libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
https://github.com/openvenues/libpostal

[fix] language disambiguation

7053c6b60b088d28adfb0c906f0b75b741a51ceb authored over 9 years ago by Al <[email protected]>
[dictionaries] Occitan stopwords for disambiguating from French

e26776a5e9227ad2c2d9f1f622210e1fc2d244a0 authored over 9 years ago by Al <[email protected]>
[languages] If a non-Latin script in a string would prohibit the found language, return ambiguous. Adding some test cases for sanity checking the labeling

f6d84531bc7395d2d23ed96d0effcbd8213dc9aa authored over 9 years ago by Al <[email protected]>
[mv] Moving the get regional/country languages logic out of language polygons

b8e4c191468f83ac669802a6a34d87767a67273e authored over 9 years ago by Al <[email protected]>
[languages] Using stopwords only to account for how ambiguous a phrase is, not for disambiguation

43178747f8251d1b2ad225670ee07ae2d3ec603d authored over 9 years ago by Al <[email protected]>
[languages] Adding non-canonicals only for streets, prefixes and suffixes. Better handling of default langauges, abbreviations and ambiguity

d8763e9d6c26e868660e22f43f6b0478bbf8d153 authored over 9 years ago by Al <[email protected]>
[dictionaries] Norwegian street types from the suffix dictionary

9c176961ffb5594db70243347baacdd4f161c753 authored over 9 years ago by Al <[email protected]>
[languages] non-default languages can still be labeled from > 1 char abbreviations if there's no evidence of other languages in the string. Adding Python version of get_string_script from the C lib

122a81b61085b100f01e794d170bd3a0408d9e91 authored over 9 years ago by Al <[email protected]>
[languages] Adding canonical back in to language disambiguation (for prefixes/suffixes too), using non-canonicals/abbreviations in non-default languages if there are no other abbreviations found, adding in stopwords dictionaries

a419dad63079eab977a4338b3aa6ec933d6349d6 authored over 9 years ago by Al <[email protected]>
[fix] No longer using abbreviations for default languages, can be stopwords, etc.

a7d9cc17824142f014e6b16b148acfc4a7f0614a authored over 9 years ago by Al <[email protected]>
[fix] import

0701bb6f086c29eee39d3aa889a9f2d42e5dfd59 authored over 9 years ago by Al <[email protected]>
[languages] Disambiguation uses language defaults, unicode normalized canonicals are treated as canonicals

723058886a776d01053eba1d520e7cd499f62b77 authored over 9 years ago by Al <[email protected]>
[languages] Disambiguation in language labeling better handles default languages and only uses canonical forms for non-default languages

6231e17f2b05871a1fc8854435831f6c313df098 authored over 9 years ago by Al <[email protected]>
[polygons] Adding a main to generate language polygons

bf829f7cb6f6ae5adad7a1b981b86a36244c9751 authored over 9 years ago by Al <[email protected]>
[languages] Adding non-default Spanish and French gazetteers to the US, and giving the country of Jersey shared English/French defaults instead of just English

5c15c4a99f7b3995c8771b52567fe3b50214d472 authored over 9 years ago by Al <[email protected]>
[fix] import

e70c2453ee5e0491e288de3acd3952f620d34aba authored over 9 years ago by Al <[email protected]>
[osm] Some countries like Lebanon in OSM will list the same address under two languages (French/English), which creates an unreasonable task for a linear classifier, so running disambiguation in those cases

390271525896a658a951dd233592518d978d4dd8 authored over 9 years ago by Al <[email protected]>
[geonames] Adding covering index to geonames DB

f6e521e3f34a0319c13edac97de1a38f31fbe7ef authored over 9 years ago by Al <[email protected]>
[mv] csv_utils

bd31dc99f28a933ec4e4ad95480f1db23e3cbe25 authored over 9 years ago by Al <[email protected]>
[languages] Adding English gazetteers to many countries where the default language is Arabic but the road signs may be in English

cc43409b726337ba0acc467e891cfc12cb8b7198 authored over 9 years ago by Al <[email protected]>
[languages] Refactorying street_types_gazetteer a bit so dictionaries are configurable

c5a9c392d4d2bd4fb8573dcbbff3229c6f4c853f authored over 9 years ago by Al <[email protected]>
[fix] language dismabiguation module

baa60aab65a638dc485c0f1dfdc1ebeab513bb98 authored over 9 years ago by Al <[email protected]>
[fix] var name

4976be64e585d96c2d9e794917d75f7880e35c90 authored over 9 years ago by Al <[email protected]>
[fix] typo

8e56568cabe940492c9718d536406658531f0854 authored over 9 years ago by Al <[email protected]>
[languages] Moving language id methods into a separate package

ca6d802a430e567d362051a562037ec9e75b909c authored over 9 years ago by Al <[email protected]>
[fix] var name

9d2f7e4bd1cdebf31a9380f488909b39956d6c6e authored over 9 years ago by Al <[email protected]>
[osm] OSM untagged formatted addresses try to use language namespaced tags

0528d1b578e7dfcd84c0069832a432162fb0d2eb authored over 9 years ago by Al <[email protected]>
[fix] via in English is a stopword, not a street type

330002197a2a320fc54521b4c17053f91764bf8d authored over 9 years ago by Al <[email protected]>
[osm] OSM untagged formatted addresses now use the new language labeling scheme

c09cb4dd82c7cd63749efdc88c401aba18078593 authored over 9 years ago by Al <[email protected]>
[fix] removing debug print

3daba2ddcd0f024c536adb757f66cbc7d6131287 authored over 9 years ago by Al <[email protected]>
[dictionaries] Updates to Galician and Catalan where they overlap with Spanish

089a197155146202e70b61dfee1fc9df68b5e065 authored over 9 years ago by Al <[email protected]>
[fix] English dictionaries

faf3435ffc9efc77f5f8572df262c4c675d26797 authored over 9 years ago by Al <[email protected]>
[dictionaries] Accented Gran Via for Catalan

9183ba4e0131adb54b1645b874f12c4311549dce authored over 9 years ago by Al <[email protected]>
[dictionaries] A few more Catalan terms that are the same as in Spanish

07b43e524e5db9a36e1bf71cb0be34e9609ef595 authored over 9 years ago by Al <[email protected]>
[languages/osm] Checking for existence of separable prefix/suffix in the given dictionaries

ffe76f04032590053be3b664acd4ec555625bcc8 authored over 9 years ago by Al <[email protected]>
[fix] English dictionary

3b55b51ef12fc750446eb0ef076d4b4ce1d841d8 authored over 9 years ago by Al <[email protected]>
[languages/osm] Adding a primitive phrase dictionary to the OSM training data construction script and a few heuristics to help disambiguate in the case of small local language groups that may not be specified with name:lang tags e.g. Occitan, Catalan, Basque, Galician, etc. Also throwing away ambiguous multilanguage names

0e00625dbd155556b14f1a164358f204a3f43200 authored over 9 years ago by Al <[email protected]>
[dictionaries] Moving a few terms in German dictionaries

fb7f2999e583912732b7acbae10b56a1f7736c84 authored over 9 years ago by Al <[email protected]>
[dictionaries] A few new terms in Dutch dictionaries to help distinguish from German

c5d14e9c4d8d4da73ce28eb31823c8782048cfba authored over 9 years ago by Al <[email protected]>
[dictionaries] Better categorization of French dictionaries

4d115fdd88e29df94ca3712e142f67541ed20357 authored over 9 years ago by Al <[email protected]>
[dictionaries] A few English dictionary terms that came up in language detection tests

0f883a887285ea658306c14ca595612558640cd7 authored over 9 years ago by Al <[email protected]>
[dictionaries] Updating Catalan dictionaries with place types to help distinguish from Spanish

db7ffa7cab9db6e8f5e016a04c26658a729b8b4c authored over 9 years ago by Al <[email protected]>
[dictionaries] Fixes to Spanish dictionaries

a1d8d3bf5fe8ea688fffc224404cb3ab9ea788f9 authored over 9 years ago by Al <[email protected]>
[fix] items

b72d9af7dcc6aa9b783047055246ac91d61afc68 authored over 9 years ago by Al <[email protected]>
[fix] getter

f3bb3c83569d18d38a0ce1f1bfa7013fe29ba3f4 authored over 9 years ago by Al <[email protected]>
[fix] name

ebd5e96bd73d7a31e40df02099cdebe7bedbc106 authored over 9 years ago by Al <[email protected]>
[fix] var name

b5be1e8df5c30126b6a756949d381f9f353df819 authored over 9 years ago by Al <[email protected]>
[fix] language polys

e84f932042e23b8f9f943404dbcd177d456c8318 authored over 9 years ago by Al <[email protected]>
[polygons] Changes to languages polygons to support new regional language handling

bada7fd13b48c1700b6d06cf13255dce360a20c4 authored over 9 years ago by Al <[email protected]>
[languages] Allowing specification of multiple regional languages

d97c725bbcab9ffd8f7802368d3696ea75ef4c1d authored over 9 years ago by Al <[email protected]>
[languages] Removing the Belarusian override as Russian appears to be used often in street signs and there are generally good name:ru/name:be tags

b8fbbb1917f76c374bcb25af296de0d161692684 authored over 9 years ago by Al <[email protected]>
[dictionaries] Adding French as equally likely language for Guernesey, which will effectively exclude it from the language training data (doesn't matter since there's already enough English/French addresses).

453aa7c633cd7cf94bfefeb5f6fd15d82aec12fe authored over 9 years ago by Al <[email protected]>
[osm] Omitting country in limited address data set (often abbreviated, doesn't convey language as well)

89071ea21a14146a52f4cdf64240323383f47264 authored over 9 years ago by Al <[email protected]>
[fix] var name

c50526091253f8cbd02ec1b993788fa762b571b7 authored over 9 years ago by Al <[email protected]>
[fix] street addresses by language

548ce79b99bd5064360336fd896b6dc5b9875758 authored over 9 years ago by Al <[email protected]>
[osm] Adding a new OSM training data option for writing out full formatted addresses without place names

74a751ce0afbad51ef5b257e1ff18d051e8f62d3 authored over 9 years ago by Al <[email protected]>
[languages] Bonaire admin1 as well as country code

133ce9e5b19b2a4961c02cbbadf5a023442ca8e8 authored over 9 years ago by Al <[email protected]>
[fix] language polygon index

05b8f555d53515cb6f579072e6679c641928fd31 authored over 9 years ago by Al <[email protected]>
[osm] Adding building tag to venues training set construction

0e92abd53e675a94f7549bcc044e8ad6e2ace686 authored over 9 years ago by Al <[email protected]>
[languages] Changing Bonaire's default road sign language to Papiamento to help distinguish from Dutch

191c0e3ce5d6f0a62c95005300915d9654ff02b7 authored over 9 years ago by Al <[email protected]>
[osm] Making minimal_only the default in formatted addresses, expanding list of acceptable combinations of address fields

cad1f95bbb0a9a118ead3c09c57a761633f56d4d authored over 9 years ago by Al <[email protected]>
[fix] road+house_number as minimal keys for formatting addresses

1e936ac9dc8f8259efbc4a6ed8bf257030dc26f1 authored over 9 years ago by Al <[email protected]>
[fix] param

83bbd67c9c6f54328d30b7aa186956d54c3046be authored over 9 years ago by Al <[email protected]>
[fix] splitter

e993ddcb51a4169ab9063cfa88513efb3c680dfb authored over 9 years ago by Al <[email protected]>
[fix] __init__

dc2766ae5d9acbce68bd857df0d59c19f4b68944 authored over 9 years ago by Al <[email protected]>
[osm] Using pipe splitter for address components

62c67aa970e0cc92d8e94950c6c2dcf0a9de5a1a authored over 9 years ago by Al <[email protected]>
[osm] Prefer amenity tag, skip if the building tag is simply building=yes

2bd763be035884dc658f782711a3eb11998669e9 authored over 9 years ago by Al <[email protected]>
[fix] carriage returns

c844d0484a296325d5cef5e70efd95a57836b82f authored over 9 years ago by Al <[email protected]>
[osm] Replacing escape chars at write time as there's no quoting, adding building key to venue training data

ef14aa2b7ed1106e979a3cd1bab456ca180ce62a authored over 9 years ago by Al <[email protected]>
[polygons] Separating out simplify polygon into a method in RTree index

9125f07af08d20daf813d86cf83c6fcbc17ab435 authored over 9 years ago by Al <[email protected]>
[osm] Using tsv_no_quote writers in all OSM training data files

46f2c68a690c549de7ee0af9a78e6389ebf3ae61 authored over 9 years ago by Al <[email protected]>
[scripts] Regenerating unicode_scripts_data file

9464670174bf1c13d2671ec7236ada60e02b004f authored over 9 years ago by Al <[email protected]>
[utils] no-quote CSV dialect

88d63c85d246379281baca0ab05bd22bc5637434 authored over 9 years ago by Al <[email protected]>
[scripts] Better script code aliasing

03febc7e209420e9e7cb2829ff016b2d03029204 authored over 9 years ago by Al <[email protected]>
[mv] csv_utils

b54ff95ecc4e527e961a223f557ea240fbed1de1 authored over 9 years ago by Al <[email protected]>
[normalize] Need to do a Latin-ASCII transliteration even if the string is entirely ASCII since it may contain HTML escapes

66a71ab70d64e33943abb64ce2bd671f69520cf6 authored over 9 years ago by Al <[email protected]>
[transliteration] Regenerating transliteration data file

87b275fcab9b4b2d9dbe3284c45d071fbf146c05 authored over 9 years ago by Al <[email protected]>
[transliteration] Doing HTML escapes first in Latin-ASCII transliteration as they may need to be resolved further in subsequent steps

cf706158508bff5c155fab471858589921c96269 authored over 9 years ago by Al <[email protected]>
[fix] phrase start in transliteration

9712e0fa8761c901dba89079048d4112ac858cd1 authored over 9 years ago by Al <[email protected]>
[phrases] Fixing tail searches in trie_get_prefix*

562a7c243da6940f8c77807de91ba27e616637eb authored over 9 years ago by Al <[email protected]>
[fix] check for local CLDR in unicode properties

51addec5f2bf54549af70527450037296418faaf authored over 9 years ago by Al <[email protected]>
[fix] ensure CLDR dir

882e4c2ab85a9a3917b3cfd16990e8bcb1120d97 authored over 9 years ago by Al <[email protected]>
[fix] cldr languages dir

48566bf0976f351498f44e1ba8d63c65874971e3 authored over 9 years ago by Al <[email protected]>
[build] ORder-only dependencies for downloading data files, rm-ing the tarball when done extracting

e98a82266117abbb2b40e93a85697906c4500d53 authored over 9 years ago by Al <[email protected]>
[build] Fixing tarball uploading

0028c2bc53674e2496b84e7ea1b0e3ce4192d501 authored over 9 years ago by Al <[email protected]>
[build] Adding tarball back to pkgdata

f21b767696b3132842c9b203da0e8efde32e9303 authored over 9 years ago by Al <[email protected]>
[api] Better handling of strings with multiple scripts and strings that use more than one transliterator. Reducing complexity/allocations

c29cf5ac9a0b08bfce6e1ac5a54f2d3ad1e9e734 authored over 9 years ago by Al <[email protected]>
[normalize] Adding the original script as an alternative in transliteration mode as well

4bc6adf6699f3a3a991abd6d585d818458f01d40 authored over 9 years ago by Al <[email protected]>
[utils] string_tree_num_strings method

a13e5117b503983f2c897366272b5d8ab5dae0d7 authored over 9 years ago by Al <[email protected]>
[cli] delete_word_hyphens as a default option

219947722d026e9c7dd8d34e3d2d6c2ce957c3b1 authored over 9 years ago by Al <[email protected]>
[api] Add separable or inseparable non-canonical string affixes (e.g. foobg. => fooburg, foostrasse => foostraße|foo straße, l'ensemble => l' ensemble, etc.) in expand_address

78a80dd86e707a0d584b367cfc92edb67ae5a906 authored over 9 years ago by Al <[email protected]>
[expansion] Adding search_address_dictionaries_prefix/suffix for concatenated prefixes/suffixes e.g. in Germanic languages. Adding a flag to the address_expansion struct and trie value to denote separability, adding prefix/suffix keys during dictionary creation

de5d6945b553cdf1f9b2147d26d1c0c627e1ad07 authored over 9 years ago by Al <[email protected]>
[normalize] Adding a char_array version of normalize token

0f77ca1213571e136c99bbf754572dd645f04d43 authored over 9 years ago by Al <[email protected]>
[utils] char_array_append_reversed for adding reversed strings without a malloc

064b6b5898d3b62b202cb7bff9c2c22b39b5b86f authored over 9 years ago by Al <[email protected]>
[fix] Only the exact TRIE_PREFIX_CHAR/TRIE_SUFFIX_CHAR characters are disallowed as keys

dab181a4d7c69e80ee28653f29ef999bec54c75b authored over 9 years ago by Al <[email protected]>
[phrases] Prefix/suffix trie search using the new characters, fixing length of matched prefixes/suffixes and exiting early on falling off the the trie

e511eede74db8caffbd95299fe0c7aefae518d0d authored over 9 years ago by Al <[email protected]>
[phrases] Changing prefix/suffix chars so both are control characters and neither is the NUL-byte. Modifying transliteration special characters accordingly

51572d65757efce01de5a8700cf3c23dc65ba9da authored over 9 years ago by Al <[email protected]>
[phrases] adding _from_index_get_prefix_char/_from_index_get_suffix_char methods

11a9881988c62ec7fb9ba379861111bcc62e4908 authored over 9 years ago by Al <[email protected]>
[phrases] trie_search_prefixes/trie_search_suffixes now take a length param

2eb67ad8501f8f66e01fe48109e2b3234af43fcf authored over 9 years ago by Al <[email protected]>
[fix] NUMEX_STOPWORD_RULE define

bbaa302e2e65bc6efdd8cf4c1c7cb48368ed8d74 authored over 9 years ago by Al <[email protected]>