-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rank emoji results in autocomplete #1112
Conversation
OK, did that too. The result was discovering #1113, fixed in #1114; and some other differences. Quoting a comment added in #1114: // Behavior differences we might copy or change, TODO:
// * Web has a particular ordering of Unicode emoji;
// a data file groups them by category and orders within each of those,
// and the code has a list of categories.
// This seems useful; it'll call for expanding the server emoji data API.
// * Both here and in web, the realm emoji appear in whatever order the
// server returned them in; and that order appears to be random,
// presumably the iteration order of some Python dict,
// and to vary over time.
//
// Behavior differences that web should probably fix, TODO(web):
// * Web ranks the realm's custom emoji (and the Zulip extra emoji) at the
// end of the base list, as seen in the emoji picker on an empty query;
// but then ranks them first, after only the six "popular" emoji,
// once there's a non-empty query.
// * Web gives the six "popular" emoji a set order amongst themselves,
// like we do after #1112; but in web, this order appears only in the
// emoji picker on an empty query, and is otherwise lost even when the
// emoji are taken out of their home categories and shown instead
// together at the front.
//
// In web on an empty query, :+1: aka :like: comes first, and
// :heart: aka :love: comes later (fourth); but then on the query "l",
// the results begin with :love: and then :like:. They've flipped order,
// even though they're equally good prefix matches to the query. |
We'll soon (for zulip#1068) be adding logic that distinguishes these emoji from other Unicode emoji. That would break some test cases which refer to an emoji that happens to be "popular", like 😄, when they really just intend the generic behavior that happens to any Unicode emoji.
These searches for 'h' and 's' would have many results in a real emoji database; in particular they'd have multiple results even in an emoji database that has just the six "popular" emoji. We'll soon make a change that causes those "popular" emoji to always be present. To prepare these tests, make the queries more specific.
This is NFC except for performance. This is an extremely common case -- it happens every time the user opens the emoji picker, and applies to every emoji -- so worth optimizing.
This is NFC except for performance: it saves us from having to construct this string again for each emoji name. At this stage there's a (much smaller) performance loss, too: we now compute this string even when we'll never actually use it because the query doesn't contain the separator. But we'll soon start checking for word-aligned matches even for those queries where we'd accept a non-word-aligned match; once we do, we'll need this string for all nonempty queries, for almost all names.
Specifically the sort takes linear time so long as there aren't more than linearly-many buckets. For our immediate use case of ranking emoji-autocomplete results, we'll in fact stick to a constant number of buckets.
This logic will grow to handle ranking, and it's closely tied to the matching logic that lives on the query.
…y private In the app, the only entry point to this logic should be through EmojiAutocompleteView. The method is public only because that made it more convenient to write thorough unit tests.
We'll need this as we add more logic here.
This makes these tests a bit easier to read already, with less noisy repetition; and it prepares them for adding more complexity to how this matching works.
This actually has a pretty nice effect on the readability of these tests, even at this stage where the enum isn't doing anything! Separate from the parent commit just because this one is a bigger, and almost entirely mechanical and boring, diff.
Fixes part of zulip#1068.
As a bonus, this provides the popular emoji as candidates even when we haven't yet fetched the server's emoji data.
b14ee1c
to
7045afe
Compare
Rebased following the merge of #1114. Resolved conflicts, fixed some of that PR's tests affected by this one, and updated some comments. |
Thanks! LGTM, please merge at will. I notice we don't give any results on CZO for these queries:
even though the emojis "yo_yo" and "thank_you" exist. Maybe not ideal 🤷♂️ but not a priority since this is web's behavior too. |
Thanks for the review! Merged. Yeah, it might be reasonable to have a query like "thankyou" match the name (The "yoyo" example has perhaps a confounding factor that the actual word is often spelled like that, as well as "yo-yo". But the solution to that would be to have an emoji named In any case, that'd be a change to make across both web and mobile. |
Fixes #1068.
As is, this PR applies to the results of autocomplete in the compose box (when you type a colon
:
). In combination with #1103 adding an emoji picker, it applies to the results in the emoji picker too, because that PR reuses the same autocomplete machinery.This mimics the ranking in the web client (from shared/src/typeahead.ts) with a handful of small discrepancies. Almost all are (minor) bugs in web:
(There may also be differences in the base ordering within custom emoji or within Unicode emoji, seen when they match a given query equally well; I haven't yet studied those to pin them down.)
The details of this implementation work a bit differently from those in web. I think this one is easier to follow, and it's certainly more efficient. Rather than iterate through the list of candidates again and again to evaluate various criteria, partitioning into sub-lists and sub-sub-lists, for each query we
Similarly, rather than check for different kinds of matches once to decide if the emoji matches the query at all, and separately later to help decide how to rank it among matching results, we check for matches just once and return an EmojiMatchQuality enum rather than just a boolean, which records the information we need for ranking too.
I think the duplication of logic within the web implementation, in each of those areas, is the root cause of the bugs listed above — they're discrepancies between one part of web's logic and another, as tends to naturally arise once the same logic is repeated in multiple places in different ways.
Selected commit messages
algorithms: Add bucketSort, for stable, linear-time sorting
Specifically the sort takes linear time so long as there aren't
more than linearly-many buckets. For our immediate use case of
ranking emoji-autocomplete results, we'll in fact stick to a
constant number of buckets.
emoji [nfc]: Add ranking framework for emoji autocomplete results
emoji test [nfc]: Make the EmojiMatchQuality values explicit
This actually has a pretty nice effect on the readability of these
tests, even at this stage where the enum isn't doing anything!
Separate from the parent commit just because this one is a
bigger, and almost entirely mechanical and boring, diff.
emoji: Rank by quality of match (exact, prefix, other)
Fixes part of #1068.
emoji: Add list of the "popular" emoji
emoji: Rank "popular" > custom > other emoji
Fixes part of #1068.
emoji: Recognize word-aligned matches in ranking
Fixes #1068.
emoji: Order "popular" emoji canonically amongst themselves
As a bonus, this provides the popular emoji as candidates
even when we haven't yet fetched the server's emoji data.