-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finish an end-to-end MVP parsing library #36
Comments
So, here's a notebook where I wrote a pretty barebones (but modular!) parser to rip out sections of an ad that contain an address. http://nbviewer.ipython.org/gist/anonymous/4185a30818e315fdd720 Addresses are definitely not entered in a standard way (or if they are, it's agency-by-agency). Sometimes they're formatted as proper mailing addresses (these are the best! They always end with "NY <5-digit number>"), sometimes it'll be "Borough of Brooklyn" instead of "Brooklyn, NY 46992" which is significantly tougher. First, I missed around with some simple regex that I figured would approximate addresses. Then I found that was a little too inflexible, given how heterogeneous our formats are. The only feature common to all of them was that, at some point, they have a number followed by a space followed by a letter - which was slightly too permissive a standard. Then, I did a pretty standard NLP move - I found what seemed like the common elements to the addresses in our data set, then encoded them as a bunch of regex features, then put them in a list (or a "feature vector", as the kids call it). The code then takes a string, checks how many of the address features are present, and gives that string a score. Right now, given an ad, it checks each sentence and puts it in a special list if it has address features Interestingly, a sentence with just one address feature contains an address pretty consistently - though I wrote the function with a tweakable threshold for the future. No proper ML at the moment. Hidden Step 0 to any ML task is to make sure you can't do it in a more reliable, deterministic way. But, making feature vectors would definitely be step 1 to making a parser that learns. At the moment, the workflow is "play around with data of interest, eyeball it, then hand-craft your classifier" - but this is the foundation of making something that can learn. |
@mattalhonte - Good stuff! I like your approach, it will fit nicely into our pipeline ( parser1 | parser2 | parser3 ). Gonna definitely play with this. Can you check if http://regexlib.com/Search.aspx?k=street has any regex patterns we could use? There may be somethings we can "borrow" from their effort. |
@mattalhonte Yes, great stuff! The python book is very clear. Not sure, but there might be something interesting structure we can get from these links as well for addreses; http://cliff.mediameter.org/ & https://github.com/Berico-Technologies |
@mattalhonte - usaddress is a python library for parsing unstructured address strings into address components, using advanced NLP methods. |
The purpose of this is to give DCAS and DOITT an example of the parsing structure, in order to prepare for later development.
The text was updated successfully, but these errors were encountered: