github.com/openvenues/libpostal commits | Ecosyste.ms: OpenCollective

Updated street types for EN

fd2b864b7d2d8bc126d8d2ab0cbab0e3efa459a6 authored almost 7 years ago by Yaroslav Veremenko <[email protected]>

[dedupe/test] adding remaining options to near_dupe_test

6575bdc33930419c2de36896e330fefa7a5df3de authored almost 7 years ago by Al <[email protected]>

[dedupe] for near-dupe hashing, remove whitespace from root expansions so something like "Ocean Walk Dr" and "Oceanwalk Dr" will have a chance of matching downstream

835de327c3145516df50638c4c6ec365dea434e8 authored almost 7 years ago by Al <[email protected]>

[numex] helper function to retrieve ordinal suffix lengths from a tokenized string for use in deduping

283be99b44eb1c667b42f495a57562d1f1d8b0ff authored almost 7 years ago by Al <[email protected]>

[dedupe] account for missing ordinal suffixes in Soft-TFIDF deduping i.e. to count 1st Place and 1 Plce as the same where there might be a misspelling and the phrase wouldn't match under exact expansions

b2dcb18d7e74ddae5e27508e34f0396f7c8027b9 authored almost 7 years ago by Al <[email protected]>

[dedupe] adding multi-word phrase alignments to deduping

b03fbdd6816850f185231fcdc507c3b53476b098 authored almost 7 years ago by Al <[email protected]>

[utils] adding utf8 case insensitive comparison

591891951d9d12892ad03655af0edeb9229451ea authored almost 7 years ago by Al <[email protected]>

[similarity] adding a multi-word alignmnet algorithm for streets and names like "de la cruz" vs. "dela cruz" or "Oceanwalk Ter" vs. "Ocean Walk Ter"

2b4e7073c29d734328a86e4a14952a5a20961173 authored almost 7 years ago by Al <[email protected]>

[normalize/api] exposing normalize_string_languages and normalized_tokens_languages to the API for pre-normalizing numeric expressions at tokenization time

c5bb9d8daa489e289ea0bd74933b3a5c3dd1baf3 authored almost 7 years ago by Al <[email protected]>

[docs][ci skip] adding definition for the street_names.txt dictionary

0edb897143587bdb0fb9028aa12554c02a758a93 authored almost 7 years ago by Al <[email protected]>

[auto][ci skip] Adding data files from Travis build #375

cf2156c0e242f5b7c6b1bba82934eb898e81b38c authored almost 7 years ago by Travis <[email protected]>

[fix] DICTIONARY_STREET_NAME applies only to the street component, unlike synonyms, which would apply to any compnoent. This makes street names a good place to add synonyms found in streets that are not exactly thoroughfare types i.e. could not be removed from the string and have it retain more or less the same meaning.

afd225048ad45154ff74edf4a4a2b2d9d282110d authored almost 7 years ago by Al <[email protected]>

[expand] using street name dictionaries as a possible root component instead of having to pollute the synonyms dictionary, which also affects the parser and might be a better place for general purpose synonyms affecting all components.

0f20613c138c8f6c990e69a47474323127e38a42 authored almost 7 years ago by Al <[email protected]>

[dictionaries] adding new street_names.txt dictionary and movign all the synonyms to there, generating the new dictionary type in address_dictionaries.py

ab67e0864a9f2a1ca2812f1ea89e61c07510a55d authored almost 7 years ago by Al <[email protected]>

[dictionaries] reverting synonyms.txt back to previous commit

cd7bf815226acb7828d0e66323c1f4e50f0d304c authored almost 7 years ago by Al <[email protected]>

[auto][ci skip] Adding data files from Travis build #374

8075ee6f49fbb63006b8f7c040a96debf52c5411 authored almost 7 years ago by Travis <[email protected]>

[dictionaries] removing "Corner" from the synonyms dictionaries for #320

a7b75681594ec6b3eba1e2cd74f797c08e83f5ed authored almost 7 years ago by Al <[email protected]>

[fix] for regular expansion, use gazetteer components or overrides

09408b1075bf4a4d963dee79a819385f4f3389af authored almost 7 years ago by Al <[email protected]>

[fix] no gazetteer changes yet, breaks the parser

c5429de4d7ec0de10dfb26fae23908c28607972a authored almost 7 years ago by Al <[email protected]>

[build] rebuild dictionaries when gazteeer_data.c changes as well

ee2bac66d3d0dce156c14921b4aa760dd4fd8e58 authored almost 7 years ago by Al <[email protected]>

[fix] adding street type gazetteer to name component as well for things like "24th St Cheese Co"

78d621ac854f2dd2959a754a42716a5b9204663f authored almost 7 years ago by Al <[email protected]>

[fix] check expansion address components for regular expansion, overrides for root expansion

9c12a11fd7d455317ca08811c95c1226f9ed6835 authored almost 7 years ago by Al <[email protected]>

[fix] for regular non-root expansion, check that components are valid (for near-dupe expansions or other cases where component options are passed in)

9390e638aeee9f4e07373456cf9d866a697fd4b7 authored almost 7 years ago by Al <[email protected]>

[fix] transliteration case where a context no match comes at the end of the string

2290b0991e3a23d23d993a78932ff497d68df5f8 authored almost 7 years ago by Al <[email protected]>

[fix] check that second double metaphone alternative is not the empty string

156c8bed40b59c6d884f6dfabd53c45d269e9bae authored almost 7 years ago by Al <[email protected]>

[fix] in root expansions, if the current phrase has at least one valid expansion, and the current expansion is not valid, ignore it

3a5c048419eaf7e844bf0e2c0c1a8bfd4c44be5f authored almost 7 years ago by Al <[email protected]>

[fix] max dictionaries is now 5, weird that that wasn't committed by the Travis build

8b96173a32e90d2215c4b4d90583c7d4ecd6badd authored almost 7 years ago by Al <[email protected]>

[auto][ci skip] Adding data files from Travis build #367

a9dded3a1862090835aa3dd2eb754df6bd13f388 authored almost 7 years ago by Travis <[email protected]>

[fix] adding newline to trigger new build

a67df148549a11ea8841dc21a1bbc0856e7c3b8a authored almost 7 years ago by Al <[email protected]>

[fix] changing root expansion test for "E Broadway", which now returns "broadway" instead of "east", which is a better result anyway

86fdaf718829fb34587d246b13083b771f4bae0d authored almost 7 years ago by Al <[email protected]>

Merge branch 'master' of ssh://github.com/openvenues/libpostal

0ee3f6b294b27e14d8ad6005b236b32e96a806ee authored almost 7 years ago by Al <[email protected]>

Merge pull request #309 from antoine-de/small_fix

Fix typo in python imports

a6628e918f9abccf3aab04c49471ecb0518fc130 authored almost 7 years ago by Al Barrentine <[email protected]>

[dictionaries] adding most of the street types to synonyms/place_names in English so e.g. "Spring" in "Spring St" will not be removed in root expansions

e3a252d4eb075f9a5a6bc5fb16dd4495b88af114 authored almost 7 years ago by Al <[email protected]>

[dictionaries] add "a" and "b" to the ambiguous expansions for English, randomly weren't in there

9baac813f63d670f283a2e43cabf875706be4bd8 authored almost 7 years ago by Al <[email protected]>

[dedupe] moving name-only near-dupe hashes to a separate if block so they can be used in conjunction with name+address

4d3619d4934faa5fcfbbda0f5057400be113ccf4 authored almost 7 years ago by Al <[email protected]>

[dedupe] to make soft token similarity order invariant, we swap the order so the shorter token sequence comes first. In the case of a tie, pick the shorter full string length

7cb85aa23c2f6be0b0d5e57a6872c0bf9739ebae authored almost 7 years ago by Al <[email protected]>

[dedupe] with some term weighting schemes (especially information gain which will soon be the default in the lieu project), single letters may have very low weights such that they will be discarded, which can lead to false positives for things like "A & B" vs. "B & C", so add a simple heuristic to simply demote likely dupes to needs review when there's a positive symmetric difference (or whatever the set theory term is for when A - B and B - A are both non-empty)

4aeb54905428f12ff21fa2defba7542e520d9744 authored almost 7 years ago by Al <[email protected]>

[dedupe] in the case of abbreviations and acronyms, where we use the higher of the two scores, calculate an offset to the norm of the other string's scores i.e. sincey we're replacing the score(s) in the lower-scoring vector with the higher one in the dot product for the numerator, do the same for the L2-norm product in the denominator. This way we don't accidentally inflate the similarity value simply because e.g. an acronym token was more rare than the same acronym spelled out as multiple individual letters (tend to be low-information/common tokens).

af5a5c30397ae939e1e9b45f24440eb5c9eda281 authored almost 7 years ago by Al <[email protected]>

[fix] Damerau-Levenshtein distance costing for transposes was off

13230824a2a22d19261078333148cb6b1c6463d7 authored almost 7 years ago by Al <[email protected]>

[dedupe] for strict abbreviations (defined as sharing a prefix and a suffix, and containing matches+gaps only by the subtotaling affine gap measure), using the greater of the two scores. This accounts for cases where the abbreviated version may have a much higher weight in one string than the non-abbreviated version does in the other. Same for acronym alignments. Making sure there's a common prefix in regular abbeviation detection Capping the Soft-TFIDF similarity at 1.0.

d0fe31d359e74746ed1bc7394dafc099a7b7a271 authored almost 7 years ago by Al <[email protected]>

[fix] was missing some shorter tokens that are unicode equal in Soft-TFIDF

b4cc7395a2d192825e81dd0719a0a5720fec1631 authored almost 7 years ago by Al <[email protected]>

[dedupe/similarity] also utilizing the L2 norm in similarity when acronyms are detected. Similarity in this case should be the acronym token's score * the L2 norm of the expanded tokens' scores in the longer string

c4aaee7dbfaf4d4c2eb565d768a9c54b6d52eea6 authored almost 7 years ago by Al <[email protected]>

fix typo in python imports

1c13bf3678f85c3065794db2f93216cff900ceab authored almost 7 years ago by antoine-de <[email protected]>

[optimization] for the FTRL and SGD optimizers, use the new *_array_sum_sq function to do L2 regularization, vs. the L2 norm which will use the linear algebra meaning

ccce4f793f67fd8fd8a5fd82fd15631acce7edbc authored almost 7 years ago by Al <[email protected]>

[similarity/dedupe] normalizing by the product of the L2 norms in soft token similarity function, as in cosine similarity. Score vectors should be passed in unnormalized, and typically with unit length. Also, for aligned phrases that share the same canonical phrase, contribute the product of the two norms of the phrase vectors to the similarity's numerator (maximum value, as if each token in both strings had matched exactly). The previous version over-counted the importance of aligned multi-word phrases by doing a cross product, which could overshadow other more important terms.

eb3fb37ad4f304ce26f1eb45192ae31f231bc05a authored almost 7 years ago by Al <[email protected]>

[fix] sqrt in l2 norm

2883b57eb36363412d46b57ba2423b69d7cf9b9d authored almost 7 years ago by Al <[email protected]>

[fix] check for sub-acronyms with no stopwords in near-dupe hashing

3c5713ef59517f7f9f4a0aa0941cc643c1079b9f authored almost 7 years ago by Al <[email protected]>

[fix] initialize repeat_state_end in transliteration. Couldn't reproduce the conditional jumps in #308, but this seems to be where they're occurring, and it's generally good practice to initialize things

fa40a7e87c6b9992fa85e2435b0c61f7e701f280 authored almost 7 years ago by Al <[email protected]>

Merge branch 'master' of ssh://github.com/openvenues/libpostal

e4c35a09119b5a408a5dbc06638d22fbd33d03bf authored almost 7 years ago by Al <[email protected]>

[fix] expansion test valgrind errors for #308

dc8bffd5a0f7c91946d4123e1d4d6b3597166eb7 authored almost 7 years ago by Al <[email protected]>

[docs][ci skip] Adding issue template

48ea0d780158c4cbe4acdd6e7ce74271d5e710a7 authored almost 7 years ago by Al Barrentine <[email protected]>

[fix] load transliteration for language classifier cli for #302

984235e87960404123d6e0117248fb81a273aa41 authored almost 7 years ago by Al <[email protected]>

[dedupe] fixing sub-acronym near-dupe hashes with punctuation, and making sure to add the current token after a new sub-acronym has been cut

7121642c6229d90bd2f8c3f98ec15be0039d20e5 authored almost 7 years ago by Al <[email protected]>

[fix] case-insensitive comparison of content-length header in data download script

95e483e3cad1ca7795159c88187e4f1590bc04b1 authored almost 7 years ago by Al <[email protected]>

Merge pull request #303 from Maurice-Betzel/master

Added unofficial Java language binding

3e3558e1993f0ab5964019c52c50272142a51a26 authored almost 7 years ago by Al Barrentine <[email protected]>

Added unofficial Java language binding

46585c89e0607985e99eabb8c33b6e7038e7be9e authored almost 7 years ago by Maurice Betzel <[email protected]>

Added unofficially supported Java language binding

ebc332a3117d5e2047bc5403ad8365e653b16f9b authored almost 7 years ago by Maurice Betzel <[email protected]>

[auto][ci skip] Adding data files from Travis build #343

a4793f0f7934252c62b8e31212410573f3290d0a authored almost 7 years ago by Travis <[email protected]>

[dictionaries] adding BYP to ambiguous expansions for the Black Youth Project

ac350d90f69448b91d8c2d9262e192b1dc3ac130 authored almost 7 years ago by Al <[email protected]>

[dedupe] adding a near-dupe hash which takes into account existing acronyms which may have appeared in the string, either known acronyms as defined in the dictionaries like "HS" and includes the full token in the acronym. This feature is particularly useful for public schools or other cases where the canonical string may be used i.e. "Foo High School", "Foo HS" and "FHS". It also does the same thing other acronyms that are identified by the tokenizer from the internal period structure like A.B.C. Also now allowing mixed alpha-numeric tokens to use the double metaphone encoding as well, and for numeric tokens with script=Common (digits but may also contain hyphens, etc.), the full token is included as one of the words rather than quadgrams, which don't make sense for numerics.

03e5e25240a1f6f023d77c7c251eb437a4e8a967 authored almost 7 years ago by Al <[email protected]>

[expand] for root expansions, delete ambiguous tokens only when there's a non-numeric non-phrase token present. This applies to all name components, not for components where numerics can be the root (house numbers, units, streets, etc.)

0286a2fef3b51d62dbdb4a4fa8656a73a1e8f9bc authored almost 7 years ago by Al <[email protected]>

[dedupe] adding a function to acronyms module to detect existing/known acronyms like MS for middle school, HS for high school, etc. Forms like MS have to be deined in the dictionaries specifically but any acronym written like M.S. will be detected as such by the tokenizer

0ee18b4f6c614ca927ce837f1f09b0f180987077 authored almost 7 years ago by Al <[email protected]>

Merge branch 'master' of ssh://github.com/openvenues/libpostal

133381f4396eac0b5588b415354a8fc90b396cf0 authored almost 7 years ago by Al <[email protected]>

[dedupe] using 4-grams with no edge disambiguation in near dupe hashing of names instead of full tokens (uses the double metaphone for Latin script, 2-grams for ideographic scripts and 4-gram unicode chars for other scripts like Arabic or Cyrllic). The fully concatenated name string with no whitespace + acronyms/subacronyms now also use double-metaphone in Latin script, and are split into 4-grams. Overall this reduces the number of keys, accounts for more misspellings as well as languages with longer words such as German, and various spacing/concatenations differences in general, while still being relatively selective. Most words in Latin scripts will resolve to less than 4 characters, so this mostly affects longer words with many consonants. 4-gram blocking tends to be what's used in the literature, and works well in practice on human and venue names. This is a slight departure from said literature in that we use 4-grams of the phonetic normalization for Latin scripts.

c553fe81eee1dfd424c39b14644040b009b365f0 authored almost 7 years ago by Al <[email protected]>

[auto][ci skip] Adding data files from Travis build #341

c58b16e745a89ed2fd223a2f754bdb14b6bdb4de authored almost 7 years ago by Travis <[email protected]>

Merge pull request #299 from antimirov/master

Updating Ukrainian dictionaries, small fixes and additional variants.

b2aa2e4fddf4b837895979ec8d36a980c4f51cd0 authored almost 7 years ago by Al Barrentine <[email protected]>

[fix] logic in sub-acronym generation for near-dupe hashes

f5e41a1f57313121bad78114d4a5c60433eb8195 authored almost 7 years ago by Al <[email protected]>

[dedupe] adding a near-dupe hash for acronyms both with and without stopwords. This will create basic acronyms for institutions like MoMA, UCLA, the NAACP, as well as human initials, etc. It also handles sub-acronyms, so when either at every other non-contiguous stopword (University of Texas at Austin) or punctuation (University of Texas, Austin), it cuts a new sub-acronym (so UT). All of the acronyms for Latin script use a double metaphone as well, so can potentially catch many cases. It does not handle all possible acronyms (e.g. where some of the letters are word-internal as in medical acronyms), but should do relatively well on many common variations.

6ba0403748a396b74a04cd36d3cacf1e09ce1dc1 authored almost 7 years ago by Al <[email protected]>

Updating Ukrainian dictionaries, small fixes and additional variants.

b1f3760a81a8b41563711f93ff086f9a273f601d authored almost 7 years ago by Yevgen Antymyrov <[email protected]>

[expand] adding another check in root expansions, making sure we don't ignore the unmodified ambiguous phrase

c29557c16bcbc0d74ecc54f4580231c51e2d8a3c authored almost 7 years ago by Al <[email protected]>

[dedupe] adding a near-dupe hash for the entire name without spaces.

e6edf54adb944d2504415f7e891c3f52fd035f41 authored almost 7 years ago by Al <[email protected]>

[expand] make street type dictionaries ignorable for venue names as well (many company names mention their address, so sort of have to apply the same rules)

66aee0fffa48dcf31c5e96c1a4a8947d004bc046 authored almost 7 years ago by Al <[email protected]>

[fix] need to calculate max Jaro-Winkler for other methods, so only test whether we should use it after we've cycled through all the tokens

e935f2a036beb21403563fa8e0052a5fd15ee4bb authored almost 7 years ago by Al <[email protected]>

[dedupe] for fuzzy street duplicates, using the likely dupe classification regardless of similarity if one set of tokens is fuzzily-contained within the other (the set of matches allows for words matching Jaro-Winkler/Levenshtein similarity, being out of order, etc. but not acronyms or the more flexible abbreviation detection, just abbreviations out of libpostal's dictionaries provided that the abbreviations use share the same canonical phrase with the aligned phrase in the other string)

179e6581e596902462beb6d31f53fbadcbf49450 authored almost 7 years ago by Al <[email protected]>

[similarity] adding a match count in Soft-TFIDF to allow answering questions about subsets i.e. the set of tokens in "Park Pl" contain the set of tokens in "Park". Setting Jaro-Winkler minimum length of 4 chars on, more specific option name for possible abbeviation detection

4356174630fb85b59b69a5c22972d4078e7cadd9 authored almost 7 years ago by Al <[email protected]>

[test] different test case for expansion, male names are overemphasized as it is

0cb488eceac6d247526d3ca545c852e5fc7586f1 authored almost 7 years ago by Al <[email protected]>

Merge pull request #297 from oschwald/greg/fix-more-leaks

Fix several more memory leaks and a segfault

513fdd775f25bc665c25ce6777799ab6cafc0f9f authored almost 7 years ago by Al Barrentine <[email protected]>

Fix segfault in expand_alternative_phrase_option

string_tree_get_alternative can return NULL

2f6749fe039c658a4bb954a97c723460cfa432d8 authored almost 7 years ago by Gregory Oschwald <[email protected]>

Only create parser response when it is used

Previously, an unused response would not be freed, causing a leak.

18cc0e37e644c85f2833c4be68d6107db310bdfb authored almost 7 years ago by Gregory Oschwald <[email protected]>

Fix memory leaks in test_trie

The primary motivation is to make the test suite run clean under
Valgrind so that we don't need ...

95ea873498823785eef964bfc9db452188b454dd authored almost 7 years ago by Gregory Oschwald <[email protected]>

Add missing char_array_destroy when numex is invalid

999de2bf6a634fd2cb71437c4c18e0fdb65bfa40 authored almost 7 years ago by Gregory Oschwald <[email protected]>

[test] adding Suite/Ste tests for root expansion bugfix

d33b6693b9fe603f3f93ecd8f2a2c2f6946b98bf authored almost 7 years ago by Al <[email protected]>

[fix] in root expansions, removing phrases that are invalid for the given components if there are other ignorable components

071aee0e853d1577dba2d89775e138a3e4d2e5aa authored almost 7 years ago by Al <[email protected]>

Merge branch 'master' of ssh://github.com/openvenues/libpostal

d8a0a344cd5d9a01fbbb437007a44bb3387a94fb authored almost 7 years ago by Al <[email protected]>

[fix] fixing a couple of warnings in dedupe/near_dupe

7651a7b9b9465e2b12aac41d7a2703141a88c6da authored almost 7 years ago by Al <[email protected]>

[auto][ci skip] Adding data files from Travis build #330

95ea250cb142525257b2f5ef06015711de16e309 authored almost 7 years ago by Travis <[email protected]>

Merge pull request #294 from openvenues/lieu_api

Near-duplicate detection and address deduping

8a917d8594eccc90d773aa180593f23edcdd538a authored almost 7 years ago by Al Barrentine <[email protected]>

[similarity] max out the Jaro-Winkler shared prefix at 4 characters in accordance with Winkler's paper

3bdb8c86306a9b155f934b20b1168f8631f3ccba authored almost 7 years ago by Al <[email protected]>

[dedupe] fixing toponym matching for city-equivalents, adding the LIBPOSTAL_ADDRESS_ANY component in each function call so it can be removed as needed.

4e325657469227b854c4552984ae7f999df18fa8 authored almost 7 years ago by Al <[email protected]>

[fix] update to struct

34c3ee7f7a1e2d41a1282eb94f33e9ce929daaf6 authored almost 7 years ago by Al <[email protected]>

[expand] adding a few of the address phrase checks to the expand header

34fe7ec305b0c33af0fd75719d40134e084dbab0 authored almost 7 years ago by Al <[email protected]>

[dedupe/test] checking for NULL in near_dupe test program

668e46796797290d7b94a508753d0cb8cfca6df8 authored almost 7 years ago by Al <[email protected]>

[api] using uint32_t for geohash precision option

3263c84b321cccf6964a2508fc7e3148dc1db354 authored almost 7 years ago by Al <[email protected]>

[fix] removing unused vars

434bbd4dc28c4696d6842dfbf1fdfeb5e8dad517 authored almost 7 years ago by Al <[email protected]>

[api] checking for NULL responses in the cstring_array methods before converting them to char arrays

86d5eca521a0f641e035a686f6bdf9d4bf83e755 authored almost 7 years ago by Al <[email protected]>

[dedupe] fixes to near dupe hashing, geohash lengths, cutting off name hashing at 50 unique tokens, fixing memory leaks, checking for valid geo components and returning NULL if one of the required fields isn't present

c48c2b778c0b6fccabdf42989a324b4af5819126 authored almost 7 years ago by Al <[email protected]>

[api] adding APIs for getting default options and using a consistent naming convention

6dff154a99ae0c03e5a032697d21c079c3eb8cd2 authored almost 7 years ago by Al <[email protected]>

[build] adding new source files to Makefile for the lieu APIs

53543be5a5e477e33a5e885db6ebf53c56e2188b authored almost 7 years ago by Al <[email protected]>

[api] adding pairwise-dupe functions/structs to the public header

8495cda1eb8b2534454dce3741220f39bb5f4844 authored almost 7 years ago by Al <[email protected]>

[fix] making a few internal functions static

cadf52d19fb9d53edc53b256383e71609574f042 authored almost 7 years ago by Al <[email protected]>