Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish an end-to-end MVP parsing library #36

Open
mikaelmh opened this issue Apr 10, 2015 · 4 comments
Open

Finish an end-to-end MVP parsing library #36

mikaelmh opened this issue Apr 10, 2015 · 4 comments

Comments

@mikaelmh
Copy link
Member

The purpose of this is to give DCAS and DOITT an example of the parsing structure, in order to prepare for later development.

@mattalhonte
Copy link

So, here's a notebook where I wrote a pretty barebones (but modular!) parser to rip out sections of an ad that contain an address. http://nbviewer.ipython.org/gist/anonymous/4185a30818e315fdd720

Addresses are definitely not entered in a standard way (or if they are, it's agency-by-agency). Sometimes they're formatted as proper mailing addresses (these are the best! They always end with "NY <5-digit number>"), sometimes it'll be "Borough of Brooklyn" instead of "Brooklyn, NY 46992" which is significantly tougher.

First, I missed around with some simple regex that I figured would approximate addresses. Then I found that was a little too inflexible, given how heterogeneous our formats are. The only feature common to all of them was that, at some point, they have a number followed by a space followed by a letter - which was slightly too permissive a standard.

Then, I did a pretty standard NLP move - I found what seemed like the common elements to the addresses in our data set, then encoded them as a bunch of regex features, then put them in a list (or a "feature vector", as the kids call it). The code then takes a string, checks how many of the address features are present, and gives that string a score.

Right now, given an ad, it checks each sentence and puts it in a special list if it has address features Interestingly, a sentence with just one address feature contains an address pretty consistently - though I wrote the function with a tweakable threshold for the future.

No proper ML at the moment. Hidden Step 0 to any ML task is to make sure you can't do it in a more reliable, deterministic way. But, making feature vectors would definitely be step 1 to making a parser that learns. At the moment, the workflow is "play around with data of interest, eyeball it, then hand-craft your classifier" - but this is the foundation of making something that can learn.

@cds-amal
Copy link
Contributor

@mattalhonte - Good stuff! I like your approach, it will fit nicely into our pipeline ( parser1 | parser2 | parser3 ). Gonna definitely play with this. Can you check if http://regexlib.com/Search.aspx?k=street has any regex patterns we could use? There may be somethings we can "borrow" from their effort.

@mikaelmh
Copy link
Member Author

@mattalhonte Yes, great stuff! The python book is very clear. Not sure, but there might be something interesting structure we can get from these links as well for addreses; http://cliff.mediameter.org/ & https://github.com/Berico-Technologies

@cds-amal
Copy link
Contributor

@mattalhonte - usaddress is a python library for parsing unstructured address strings into address components, using advanced NLP methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants