-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove variant names from index #529
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Paris
Pantruche
Ville-Lumière
Paname
Lutèce
I'm quite amazed, Google does return Paris when you search Ville-Lumière
or Paname
, maybe an easter egg 😅
I'm 👍 for this, just interested to see if it causes any acceptance test failures before merging. |
If this is successful we should open corresponding issues on placeholder/parser/spatial to ensure they follow suit. I don't remember the history of this but it's likely the convention was copied across to other parts of the codebase. |
I thought things like It's seems 'variant' is where toponyms go to die 😆 |
Agreed, I was just thinking about that. The impact of all those extra names is probably even higher for Placeholder since it considers matches across the entire parent hierarchy. |
Who's on First variant names are a useful collection of unofficial names for places, but they tend to be pretty messy. This PR explores the effect of removing them from indexing. While there might be occasionally useful names in there, it seems like the majority are exact or near duplicates of more official names, or names that are so colloquial that they are not particularly useful (do we _really_ need to support returning NYC for queries for "the big apple"?). For reference, here are some variant names for some key places, just to record the kind of data that's in there: NYC: ``` Bigapple NY City NY Cty New York City New York Cty Newyork Newyorkcity Novaiorque Nycity Thebigapple Big Apple ``` San Francisco: ``` S Francisco S. Francisco SFO Sanfran Sanfrancisco Frisco ``` China: ``` China - Peoples Republic China Peoples Rep China, People's Republic Chinese PR China PR of China People's Republic of China Peoples Republic of China ```
32d0ede
to
690919e
Compare
Okay, the results from this are in and they look pretty good. There is a decent increase to the overall score of our autocomplete acceptance tests, and a big increase in some other test cases like As far as I can see there are almost no significant regressions from this change. A few individual autocomplete characters here and there, but nothing that looks like a trend. If I had to summarize, overall it looks like removing variant names has three main positive effects
Here's some cases that show off one or more of these. New York, New YorkThis is a query that has often been tough to get right since there are several results we want near the top, and lots of chances for duplicates or undesirable records to sneak in.
The test says that both New York city and county should appear in the results. I'd argue We should add New York State to that list. But in any case the removal of variant names mean that the desired results for the WOF city and county records score higher than before. This boosts them above the I think this also shows that when we really fix our scoring in pelias/pelias#862, we'll see even more and better results like this. MissouriA common trend in city and state results is fixing issues where a record simply wasn't ever displayed because it would be deduplicated. For example, the state of Missouri would essentially never come up in results because it would be deduped with Missouri Township, MO, which has Our deduplication code currently prefers more granular results (for example, There's still some deduplication related issues here that we should look at, many of them can be fixed with data updates. Various cities now showing up earlier in autocompleteIndividually these are all not necessarily amazing changes, but I noticed a decent trend of cities showing up one or two characters earlier in results. When we're talking queries that are only 2-4 characters, that's actually a big deal! Final notesI was expecting a big of a decrease in index size for this change, since there are a reasonable number of variant names out there. But it turned out to only be about 5MB. I suppose there might be a slight performance increase because fewer documents will match any given query, but I'm expecting it to not be something we can notice. We should just be able to go by the various improvements and feel confident merging this :) |
These pass as a result of removing variant names from WOF in pelias/whosonfirst#529
Who's on First variant names are a useful collection of unofficial names for places, but they tend to be pretty messy. This PR explores the effect of removing them from indexing.
While there might be occasionally useful names in there, it seems like the majority are exact or near duplicates of more official names, or names that are so colloquial that they are not particularly useful (do we really need to support returning NYC for queries for "the big apple"?).
Here are some variant names for some key places, just to record the kind of data that's in there:
NYC:
San Francisco:
China: