Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/openvenues/libpostal

A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
https://github.com/openvenues/libpostal

[openaddresses] Ignoring cities starting with UT in St Louis County, MN

979f866678586f5ee5649a83b5f6db9008f77fdb authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Mexico countrywide, removing add_osm_boundaries from New Orleans

8980131f3fbf68f5bb4884625b9153508a1bf06e authored almost 8 years ago by Al <[email protected]>
[fix] add CBLAS_LIBS in the test Makefile

65fadbeea32068908103f77122ffb0355c6fcdee authored almost 8 years ago by Al <[email protected]>
[fix] removing WIP

f7889bf1380aef5725aa1e4af1bf639aa58e3357 authored almost 8 years ago by Al <[email protected]>
[fix] HAVE_CBLAS in matrix.h, memcpy needs to use sizeof(type)

d757baaf56ed7dc7b0ed4a9bc6f9a8e5fcd87842 authored almost 8 years ago by Al <[email protected]>
[build] trying a CBLAS-specific macro that doesn't rope in Fortran

a7c9b919e93bc313f11a85a97b0b7c671f983cee authored almost 8 years ago by Al <[email protected]>
[fix] typo

9636ef6393359659116b113ba12ac0d0b92269d5 authored almost 8 years ago by Al <[email protected]>
[places] allowing training examples in the US and Canada with no city 5% of the time so the road=>{county,state} transition is more likely

3e051a30da9134ac4fa629d84c746273c3cf46c1 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding units to Denver

b90a703746021bd26dad43f564448a98a3543390 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Piemonte

f0d37cc56d0bd4c6a8eb29cc369406c4b0611b03 authored almost 8 years ago by Al <[email protected]>
[test] adding Brazil and Romania parses for demo

0bd1bdb6f2caeffedac56373b35eb24e3b7cbb52 authored almost 8 years ago by Al <[email protected]>
[test] adding US tests for parser demo queries

03ceb18a4189eb43724de2a65345d0e882b4cbe8 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Belmont County, OH

22d97a0a35e9cd8b8fbdc08f1e8f29b4b78a5059 authored almost 8 years ago by Al <[email protected]>
[openaddresses] add McKinney, TX

5ac891c4844c34f49e6c93edb7107e08483e4381 authored almost 8 years ago by Al <[email protected]>
[dictionaries] adding Dep. as an abbreviation for departamento in Spanish

40f594e3be46a0e3f9e1b2159e5a711f5d844a89 authored almost 8 years ago by Al <[email protected]>
[openaddresses] Sibley County, MN

c0bded412c0954d7ef11376aa88bff1e7d9fded1 authored almost 8 years ago by Al <[email protected]>
[addresses] adding the ability to hyphenate the generated unit/floor numbers, either for ranges or simple hyphenated numbers, including hyphenated variants of the letter + number or number + letter forms. Implementing for English but something similar can be done in the other configs.

217de3a8a214dab3f731dd555109b12fbc4f7f60 authored almost 8 years ago by Al <[email protected]>
[addresses] allowing number/ordinal spellout in the Trappa/Trappor Upp syntax in Swedish, didn't make it into the release

56f00250c2c24eef4290f9d15db48d332e5cba66 authored almost 8 years ago by Al <[email protected]>
[test] making some of the test cases simpler/easier so they don't fail. In general this should just be for examples that are/are going to be in the docs. Improving overall aggregate statistics like held-out accuracy over time is preferable to worrying about one individual test failure.

61d008f3496a564b838b5f3279d073c9632bc52d authored almost 8 years ago by Al <[email protected]>
[countries] use ISO 3166 country name 5% of the time for general addresses, 10% of the time for OpenAddresses. Gives the parser examples of names like "Korea, Republic of" in #168

81c59e116a73144d880ba5a51d2c6856f9e5dd95 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Korea countrywide dataset

ecfa6855e72b180ac579684e1ceb993a2b19069c authored almost 8 years ago by Al <[email protected]>
[formatting] adding postcode before city insertion for former USSR countries

8e4b909013321f950fd249fa57bd361688596440 authored almost 8 years ago by Al <[email protected]>
[places] increase state_district probability in India

9fccfa0997adbe5c0f4c54cae8102ef1d1b63174 authored almost 8 years ago by Al <[email protected]>
[test] add LaSalle, Montréal tests

3aaa628b2565d310c5ba64eb680a11ab3c85f23c authored almost 8 years ago by Al <[email protected]>
[test] adding a number of user-contributed test cases from Moz in #21. Almost all are working under the CRF parser trained on 10% of the data. There are a few problematic ones in the UK still that have been omitted here. We currently don't correctly format the training data for locailty + postal town pattern, which are both considered "city" by libpostal and thus one will usually get lumped in with the road or something like that. There may also be some utility in modelling comma usage (training data has commas, but they're ignored by the parser both at train and run time - might be useful to train on them but drop out randomly so the parser doesn't become too dependent on having them)

1f1dbe25e1621aff3283cf3a13370af4a47d5cd4 authored almost 8 years ago by Al <[email protected]>
[matrix/utils] adding resize_fill_zeros

7fe84e6247a564604f2f932a7e2cb3e854298383 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Sicily statewide

2bda741fa9abc6139741e70165445e8f3299b1ef authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Novosibirsk Oblast, Russia

67805047f4584c336692f8c2c8eae05fe4412f77 authored almost 8 years ago by Al <[email protected]>
[test] adding parser test cases in 22 countries. These may change, and I'm generlaly against putting every obscure test case in the world in here. It's better to measure accuracy in aggregate statistics instead of individual test cases (i.e. if a particular change to the parser improves overall performance but fails one test case, should we accept the improvement?) The thought here is: these represent parses that are used in documentation/examples, as well as most of those that have been brought up in Github issues from the initial release, and we want these specific tests to work from build to build. If a model fails one of these test cases, it shouldn't get pushed to our users.

b8a12e05172f00fa8357345dc67c8776d0e54655 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Chesterfield, SC

7218ca131694097ff6fd2010a768d1ce4e32d8d4 authored almost 8 years ago by Al <[email protected]>
[fix] handle multiple separators (like parens used in https://www.openstreetmap.org/node/244081449). Creates bad trie entries otherwise, which affect more than just that toponym

3b9b43f1b5251b993900d3c94aa1cbf9c14d5c07 authored almost 8 years ago by Al <[email protected]>
[parser] using a bipartite graph (indptr + indices) to represent postal code<=>admin relationships instead of a set of 64-bit ints. Requires |V(postal codes)| + |E| 32 bit ints instead of |E| 64 bit ints. Saves several hundred MB in file size and even more space in memory because of the hashtable overhead

c67678087fa47c661a8a9ae6fbdf9b807a5c4247 authored almost 8 years ago by Al <[email protected]>
[transliteration] regenerate transliteration data

cb112f0ea76b6c7405dc6642d6c7c77e498ea713 authored almost 8 years ago by Al <[email protected]>
[fix] with the new CLDR transform format, reverse the lines rather than the nodes in reverse transliterators

579425049b8893704c48a62084a969489fab42af authored almost 8 years ago by Al <[email protected]>
[test] adding test of new latin-ascii-simple transliterator which only handles things like HTML entities

8e3c9d02693365b2afc9eb038ac3d40c2283154d authored almost 8 years ago by Al <[email protected]>
[test] adding printfs on expansion test failure so it's more clear what's going on

be07bfe35d91975c6e04e895d74191bc937f83fd authored almost 8 years ago by Al <[email protected]>
[phrases] set node data only when we're sure we have a correct match, otherwise the longer phrase may actually be matched

dfabd25e5dac95b44c0891c531aee87df987b76e authored almost 8 years ago by Al <[email protected]>
[fix] don't compare a double to 0

f4a9e9d673e612ba182f6eca7938601ad36edfd2 authored almost 8 years ago by Al <[email protected]>
[fix] need to store stats for component phrases that have more than one component, otherwise only the first gets stored and everything is an "unambiguous" phrase, which is not true

266065f22f88b25e06794c51f03035e0b1bcef43 authored almost 8 years ago by Al <[email protected]>
[parser] thought numeric boundary names had already been removed in the source data, but someehow they've made it into one of the data sets. Doing a final check in context_fill for valid boundary names (currently valid if there's at least one non-digit token)

0b27eb3f74e0531e8fc26b9bfeeac209f8170151 authored almost 8 years ago by Al <[email protected]>
[utils] adding string_is_digit function, similar to Python\'s (i.e. counts if it's in the Nd unicode category)

1b2696b3b5703bb38d4e8165abb5854bd170cfb1 authored almost 8 years ago by Al <[email protected]>
[parser] parser only inserts spaces in the output if there were spaces (or other ignorable tokens) in the normalized input

1a1f0a44d2e6d1ce6fc28b872d57eef2442a2d2e authored almost 8 years ago by Al <[email protected]>
[fix] log_sum_exp in SSE mode shouldn't modify the original array

d43989cf1cdcad5a840aaa606afb5b3496e85127 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding some of the new counties in GA. Adding the simple unit regex to DeKalb county's ignore list as there are a few in there

c201939f3ab86fb3f6c4a4e3466e3b97ac193277 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding language-delineated files for South Tyrol

e0a9171c09a9c3f992870143cc6763322050d17e authored almost 8 years ago by Al <[email protected]>
[fix] handle case of T = 0 in Viterbi decoding

6cf113b1df2528cc16e2a436ccb1db77fa61c66a authored almost 8 years ago by Al <[email protected]>
[fix] move

35ccb3ee62a5ac81498ee9e1e64928c39645f2a7 authored almost 8 years ago by Al <[email protected]>
[fix] heap issues when cleaning up CRF

d40a355d8bd4cc54a08e7eecef2eff177cf6190b authored almost 8 years ago by Al <[email protected]>
[logging] some small logging changes to track vocab pre/post pruning

1277f82f52623d87e3524081e06cb6af299e8856 authored almost 8 years ago by Al <[email protected]>
[test] adding the new tests to the Makefile

7afba832e594725138d0dc8ed9eb94eefe33d7e6 authored almost 8 years ago by Al <[email protected]>
[crf] in averaged perceptron training for the CRF, need to update transition features when either guess != truth or prev_guess != prev_truth

7562cf866bdabd52eaa9fbdb2cea4c1fe3adc984 authored almost 8 years ago by Al <[email protected]>
[fix] had taken out a previous optimization while debugging. Don't need to repeatedly update the backpointer array in viterbi to store an argmax when a stack variable will work. Because that's in the quadratic (only in L, the number o labels, which is small) section of the algorithm, even this small change can make a pretty sizeable difference. CRF training speed is now roughly on par with the greedy model

a6eaf5ebc57fbca025eb4f0908a0084523f1e0d4 authored almost 8 years ago by Al <[email protected]>
[fix] formatting for print features in CRF model

647ddf171d7113366b6b356a01aedd64d6c584e0 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Douglas County, OR

735fd7a6b7d5c41a4c8972c98084d93a7705e2a8 authored almost 8 years ago by Al <[email protected]>
[fix] add CRF files to the main lib

d876beb38601f689e07f642f7a50d487e5b2a03b authored almost 8 years ago by Al <[email protected]>
[build] adding necessary sources to address_parser client, address_parser_train and address_parser_test

0ec590916b6063fbfa73ca36cf2c7a53f359e1c7 authored almost 8 years ago by Al <[email protected]>
[utils] new_fixed and resize_fixed in vector.h

25649f2122c60bb913a346f79f68fd2bb4ae517b authored almost 8 years ago by Al <[email protected]>
[utils] adding file_exists to header

4e02a54a79d867387891bab28196e48e51ded3bb authored almost 8 years ago by Al <[email protected]>
[parser/cli] removing geodb loading from parser client

5775e3d8063b5ad6a956516fe080cd72a3901476 authored almost 8 years ago by Al <[email protected]>
[parser] adding polymorphic (as much as C does polymorphism) model type for the parser to allow it to handle either the greedy averaged perceptron or a CRF. During training, saving, and loading, we use a different filename for a parser trained with a CRF, which is still backward-compatible with models previously trained in parser-data. Making necessary modifications to address_parser.c, address_parser_train.c, and address_parser_test.c. Also adding an option in address_parser_test to print individual errors in addition to the confusion matrix.

8deb1716cbb9aebd90f4e28fdb89c665489d5f8f authored almost 8 years ago by Al <[email protected]>
[openaddresses] add Gillespie County, TX

1bd4689c5f0af3d711413dd84477b97135ca32d0 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Fisher County, TX

171aa77ea3af8b402309c6d504bb7f3ac25e3a23 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Coffey County, KS

8e3bcbfc959e4081b7b2664783ed64a36dfcbd5c authored almost 8 years ago by Al <[email protected]>
[utils] adding a function for checking if files exists (yay C), or at least the closest agreed-upon method for it (may return false if the user doesn't have permissions, but that's ok for our purposes here)

b85ed706741d8325c78a1a4d85e52b6dd580a683 authored almost 8 years ago by Al <[email protected]>
[cli] no longer need geodb setup in address parser client

3b33325c1a9437b01e165d0c36e8cb109a32f2ee authored almost 8 years ago by Al <[email protected]>
[parser/crf] adding runtime CRF tagger, which can be loaded/used once trained. Currently only does Viterbi inference, can add top-N and/or sequence probabilities later

ef8768281b7c35621ea6e39c40b69c77401ab984 authored almost 8 years ago by Al <[email protected]>
[parser/crf] adding an initial training algorithm for CRFs, the averaged

perceptron (FTW!)

Though it does not generate scores suitable for use as probabilties, and
might...

9afff5c9ed6e8367bc6d31fadb43b68686e9bffa authored almost 8 years ago by Al <[email protected]>
[parser/crf] adding crf_trainer, which can be thought of as a "base class" as much as that's possible in C, for creating trainers for the CRF. It doesn't deal with the weights or their representation, just provides an interface for keeping track of string features and label names, and holds the crf_context

5cac4a7585ced66d9e01dcebebad964b52845547 authored almost 8 years ago by Al <[email protected]>
[test/utils] also a good thing to sanity check (in C especially): string handling code

dd0bead63a698486e91640a8a0c109462b6342aa authored almost 8 years ago by Al <[email protected]>
[test/crf] test for crf_context, adapted from crf1dc_debug_context in CRFsuite. Always a good idea to sanity check numerical code

adab8ab51ad8e6c73bd6327e9a0109f11fa0945c authored almost 8 years ago by Al <[email protected]>
[parser/crf] adding the beginnings of a linear-chain Conditional Random Field

implementation for the address parser.

One of the main issues with the greedy averaged perceptro...

f9a9dc22241bc3b26cf78afc63d1ecef8755e600 authored almost 8 years ago by Al <[email protected]>
[parser] size the postcode context set appropriately when reading the parser, makes loading a large model much faster

f9e60b13f5d04e418b20b2b3cd5176dfb14cfd33 authored almost 8 years ago by Al <[email protected]>
[fix] fixing up hash str to id template

24001221628d9e060a25c3cefe191860f305552d authored almost 8 years ago by Al <[email protected]>
[parser] for the min updates method to work, the feature that have not yet reached the min_updates threshold also need to be ignored when scoring, that way the model has to perform without those features, and should make more updates if they're relevant

4c03e563e045a16fd96db531828ca6220c983502 authored almost 8 years ago by Al <[email protected]>
[parser] right context affixes need to use pre-normalized words as well

a63c182e9659434678291670fc82136026a63562 authored almost 8 years ago by Al <[email protected]>
[parser] fixing some issues in address_parser_features. Prefix/suffix phrases use the word before token-level normalization (but after string-level normalization like lowercasing), needed to use the same string in the feature function as in address_parser_context_fill. Affects some German suffixes like "str." where the final "." would be deleted in token normalization, but the suffix length would include it. Also, three of the new arrays used in address_parser_context (suffix_phrases, prefix_phrases, and sub_tokens) weren't being cleared per call, which means computing the wrong features at best and a segfault at worst

ce9153d94d339fe348aec9ddcd2da9a2007cd8bf authored almost 8 years ago by Al <[email protected]>
[utils] adding aligned malloc/free/realloc in vector.h and matrix.h, fixing bug in matrix_copy

b6bf8da383354d491ff29d2fd4866101ba1a59d8 authored almost 8 years ago by Al <[email protected]>
[parser] using new API in address_parser_test

242b1364aeb4490614606d03ae001108ecfaae9f authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Mayenne, FR

39f59e7ecf7f5634a05f95440fe7c1e595a5db2b authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Hernando County, FL

c2b516c76116aa8f1c59f6b9d06189f30400268b authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding city of Carlsbad, NM

749bb4907eceb17b2be9a88462987c9f142a3709 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding city of Amarillo, TX

154fd42299a47f77d2ad7bdffbdb796d9032ae88 authored almost 8 years ago by Al <[email protected]>
[parser] learning a sparser averaged perceptron model for the parser using the following method:

- store a vector of update counts for each feature in the model
- when the model updates after m...

95015990abbdc37331eedf30aed67adb446233f5 authored almost 8 years ago by Al <[email protected]>
[parser] moving tagger function pointer definition to a separate header so it can be used for other models

5c1c1ae0f2bf56fc77c63ab75eb0d45fa54e2e7c authored almost 8 years ago by Al <[email protected]>
[parser] fix another valgrind error in parser training (cstring_array memory can get moved around when using string pointers obtained before adding to it, which can potentially cause a realloc), no longer using the dummy START tags as the feature function can choose to add features for those cases

cc58ec9db2622f838de6e6779472209c787a2a8c authored almost 8 years ago by Al <[email protected]>
[parser] moving feature printing to averaged perceptron tagger, taking advantage of trie prefix-sharing in feature incorporating previous tags

754f22c79a80132ce14ecbac11c8b6ce890b7545 authored almost 8 years ago by Al <[email protected]>
[parser] fixing affix-related valgrind errors in address parser features

839a13577d8e23d3134e809b3bebb2b318ad1dbc authored almost 8 years ago by Al <[email protected]>
[parser] counting classes instead of keeping a set

c3581557a1f569ce629c59750f8822a64c5a73c1 authored almost 8 years ago by Al <[email protected]>
[fix] trie_new_from_hash

a5283cb3132649cecbb1fa3c82ce003ebf4568b5 authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Hattiesburg, MS

23ed916f09ef47404ae2dec6b7595c1b1c30cb7c authored almost 8 years ago by Al <[email protected]>
[openaddresses] adding Longueuil, QC, Canada

90cb4d904d09dd0773314c2ed4fdbca416d664a5 authored almost 8 years ago by Al <[email protected]>
[utils] tracking keys added in trie construction from hash

5113a1bc3264410222c62e349f24b9be7fd7f8a1 authored almost 8 years ago by Al <[email protected]>
[parser] simpler feature names for the state-transition features

dd4f3eb84c3206bc0dcc78f482226e64e7c3332c authored almost 8 years ago by Al <[email protected]>
[parser] counting num classes in address parser init for models where it is needed a priori

39fa8ff1a5de0b655d24775ab0f158c7697ad612 authored almost 8 years ago by Al <[email protected]>
[parser] more logging in init

5f19e63cbe4982c82ba963372e5636cb4bddc527 authored almost 8 years ago by Al <[email protected]>
[openaddresses] add city of Alexandria, LA

4d2f77b3f34e02ec181ee5d82ff6b7d57dec5656 authored almost 8 years ago by Al <[email protected]>
[parser] adding log message

bb922e4ce44e55c534e34c3b717aa1b081fc5a12 authored almost 8 years ago by Al <[email protected]>
[parser] fixing chunked shuffle, making awk splitting work on Mac

b97de96ab4bf68a339c02b96b8dbe20c1c8103c9 authored almost 8 years ago by Al <[email protected]>
[parser] uint64_t chunk size, no warning if gshuf is available

0e49fc580acb6e984c0247b699844ef8d34f70a9 authored almost 8 years ago by Al <[email protected]>
[openaddresses] add unit phrases in Cape Girardeau, MO

d99f83b84a04c6a45c4c6c418b12d53759b9a1f8 authored almost 8 years ago by Al <[email protected]>