-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parameterised fuzzy autocomplete #1510
base: master
Are you sure you want to change the base?
Parameterised fuzzy autocomplete #1510
Conversation
Hi @andreibalaban, I tested this out real quick and it does work pretty well overall, right? However, as you might have discovered, adding I'm not going to go into a ton of detail here, because we honestly don't know all the answers ourselves, but here are some considerations for merging fuzzy matching: API ParametersWe're generally pretty hesitant to add new parameters to the Pelias API. Once they're in, they can't be taken out, and parameters that affect performance negatively require special scrutiny. Lots of Pelias users are using it with minimal to no intermediate layer between them and end users, so to some degree the Pelias API itself needs to be "user-proof". Some things we might want to do here
Extra non-matching resultsThe cool part of fuzzy matching is that it lets you return the intended result with typos. The bad part is it potentially makes you return lots of unintended results even if there are no typos. Consider a query for Now it returns a bunch of other addresses on the same street This is tricky to even measure: our current acceptance-test framework would still mark this as passing, since the correct result is indeed returned first. However, it's a bit of a regression from a user experience standpoint. At least a few brain cells have to be used to make sure the first result is indeed the right one 🧠. We've talked about adding a "post-processing" step that somehow looks at all the results and prunes out options that are clearly bad, but that's a difficult problem. Another potential solution is to run an Elasticsearch query without fuzziness first, and then only run another with fuzziness if there were no results returned. That has its own downsides. Complex query clausesThe autocomplete endpoint is pretty complicated under the hood as you now well know. I see you had to make a couple changes to the queries themselves to enable fuzzy matching, and we have to be careful with those. Also, it looks like not every single query clause has fuzzy matching enabled, which leads to things like This isn't your fault, we really need to do some cleanup to the autocomplete endpoint, hopefully making it simpler to work with, and probably improving the results along the way. This is really hard work, and there's probably a happy balance where fuzzy matching works in most cases, but there's still more work to be done. Performance and response timeStating another obvious thing, autocomplete has to return results really fast 💨. While adding fuzzy matching capability is definitely worth some response time, we'd want to test this out quite a bit to make sure its worth it. In our experience synthetic loadtesting can help, but the real answer comes from testing live traffic at scale. That's an extra challenge for something like this that changes the responses and parameters, but it's not insurmountable. You probably have some capabilities to do this, as we certainly do. Next stepsSo, I'm really happy to see this and push it forward over the next few months. We've been dragging our feet on it a bit, since it's going to be a challenge, but working with others who are motivated helps there 🧑🤝🧑. Feel free to follow up here with more thoughts and comments. Also we should probably set up some sort of call to figure out a path forward. [email protected] is always open :) Thanks again! |
Hello @orangejulius, I'm really glad that our initiative for integrating I wrote on '[email protected]' to schedule a call with you guys. Cheers ! |
Hello @orangejulius, It was great meeting you last week and we really appreciated the exchange of ideas. Thank you! |
@orangejulius any way we can move forward with this? Would be nice to merge the logic upstream. |
Hi @andreibalaban and @mihneadb, Sorry for the delay here! I finally got around to really taking a look at the impact of this PR by setting up a branch with some reasonable default values for the fuzzy search parameters and running it through our acceptance test suite. While there are a lot of cases where fuzzy autocomplete helps (obviously), I think there will be more tuning work to do before we can merge this into Pelias. Impact on non-typo queriesThe main issue I saw is that there are a decent number of cases where the "correct" results for a query with no typos no longer comes up first. For example, take this query for Paris: Pakistan coming up first is a bit of a dealbreaker here. A rough estimate from all our acceptance tests is that this happens around 3-5% of the time. By no means do we need to wait until there are no regressions before merging fuzzy autocomplete, but from my non-scientific testing we aren't there yet. Postal codesPostal codes are particularly impacted. Since most numeric substitutions for any valid postalcode will also result in a valid postalcode, and all postal codes will generally have the same score from Elasticsearch, the odds of the "right" one coming up first as it stands now is low. In this case the expected postalcode comes up 10th, so it barely even makes it into the results at all. Next stepsI think the path forward here probably entails a bit of work: we might want to add logic to detect cases (like all numeric inputs for postalcodes), where we don't do any fuzzy matching. We also might have to add new query clauses that give a slight boost to exact matches, to solve issues like the Paris query above. If you have any other ideas or have techniques that have worked for you, let me know. This is an important feature so we do plan to keep working on it. It's not the main thing we're working on today, but we are definitely not going to stop working on it just because it's hard :) |
Oops, quick follow up here. I just realized that this PR is supposed to include logic for boosting exact matches. As outlined in my examples above, that's definitely important. I'm not 100% sure my branch to test fuzzy autocomplete is correctly enabling that new logic though. Let me know if it looks broken to you. It might also be that the new exact match logic isn't quite working right, in which case by fixing it we can improve result quality quite a bit. Also let me know if you need any help running our acceptance tests to do your own testing. Both the regular acceptance tests and some of the test suites from our fuzzy tests, in particular the city search suite are useful here. It can be a bit of an art to interpret the acceptance test results though, so let us know if we need to lend a hand. |
Hello @orangejulius, Thank you for continuing work on this and sharing insight about the problems that you discovered. I ran a quick test using your branch (but using our data) and I managed to reproduce the postal code issue but not the one where Pakistan appears first while looking for Paris. That being said we are willing to make an effort to make this work for a more "general-purpose" use-case, hopefully in a few weeks. Maybe what needs to be done is to invest more into reformulating the "exact-match" clause (which right now I think takes only complete tokens into account, and could be improved) and parametrise somehow the balance between fuzzy+popularity driven matches and exact match oriented results (maybe with a default on exact match oriented results to suit a general purpose use). Thank you! |
…uery parts from the "must" clause that were "fuzzified" are now part of the "should" clause, in their original (exact match) form.
Hello @orangejulius, I reformulated the "exact match boost" clauses in my latest commit Thank you! |
👋 I did some awesome work for the Pelias project and would love for everyone to have a look at it and provide feedback.
Here's the reason for this change 🚀
Working with
pelias-api
I discovered that a significant source of "no results" for theautocomplete
endpoint are typos.A "fuzzy" autocomplete would addresses this, is supported by current elasticsearch queries and I also saw that this idea was previously explored in this PR : (#1268) which was my initial source of inspiration.
The changes presented here support fuzziness on the current form of the autocomplete query, which includes multi-language support and a slightly different body than at the time of cited PR.
Also,
fuzziness
andmax_expansions
are exposed as parameters at request time - when they are not given, the autocomplete queries are the same as before.Here's what actually got changed 👏
Here's how others can test the changes 👀
test/unit/query/autocomplete_fuzzy.js
) and fixtures are written for the fuzzy form of the queries