Rank emoji results in autocomplete #1112

gnprice · 2024-12-08T23:15:37Z

As is, this PR applies to the results of autocomplete in the compose box (when you type a colon :). In combination with #1103 adding an emoji picker, it applies to the results in the emoji picker too, because that PR reuses the same autocomplete machinery.

This mimics the ranking in the web client (from shared/src/typeahead.ts) with a handful of small discrepancies. Almost all are (minor) bugs in web:

    // Behavior differences we might copy, TODO:
    //  * Web ranks each name of a Unicode emoji separately.
    //  * Web recognizes a word-aligned match starting after [ /-] as well as [_].
    //
    // Behavior differences that web should probably fix, TODO(web):
    //  * Among popular emoji with non-exact matches,
    //    web doesn't prioritize prefix over word-aligned; we do.
    //    (This affects just one case: for query "o",
    //    we put :octopus: before :working_on_it:.)
    //  * Web only counts an emoji as "popular" for ranking if the query
    //    is a prefix of a single word in the name; so "thumbs_" or "working_on_i"
    //    lose the ranking boost for :thumbs_up: and :working_on_it: respectively.
    //  * Web starts with only case-sensitive exact matches ("perfect matches"),
    //    and puts case-insensitive exact matches just ahead of prefix matches;
    //    it also distinguishes prefix matches by case-sensitive vs. not.
    //    We use case-insensitive matches throughout;
    //    case seems unhelpful for emoji search.
    //  * Web suppresses Unicode emoji names shadowed by a realm emoji
    //    only if the latter is also a match for the query.  That mostly works,
    //    because emoji with the same name will mostly both match or both not;
    //    but it breaks if the Unicode emoji was a literal match.

(There may also be differences in the base ordering within custom emoji or within Unicode emoji, seen when they match a given query equally well; I haven't yet studied those to pin them down.)

The details of this implementation work a bit differently from those in web. I think this one is easier to follow, and it's certainly more efficient. Rather than iterate through the list of candidates again and again to evaluate various criteria, partitioning into sub-lists and sub-sub-lists, for each query we

give each result a single ranking score once, based on all the criteria;
then sort by rank, using a stable sort.

Similarly, rather than check for different kinds of matches once to decide if the emoji matches the query at all, and separately later to help decide how to rank it among matching results, we check for matches just once and return an EmojiMatchQuality enum rather than just a boolean, which records the information we need for ranking too.

I think the duplication of logic within the web implementation, in each of those areas, is the root cause of the bugs listed above — they're discrepancies between one part of web's logic and another, as tends to naturally arise once the same logic is repeated in multiple places in different ways.

Selected commit messages

algorithms: Add bucketSort, for stable, linear-time sorting

Specifically the sort takes linear time so long as there aren't
more than linearly-many buckets. For our immediate use case of
ranking emoji-autocomplete results, we'll in fact stick to a
constant number of buckets.

emoji [nfc]: Add ranking framework for emoji autocomplete results

emoji test [nfc]: Make the EmojiMatchQuality values explicit

This actually has a pretty nice effect on the readability of these
tests, even at this stage where the enum isn't doing anything!

Separate from the parent commit just because this one is a
bigger, and almost entirely mechanical and boring, diff.

emoji: Rank by quality of match (exact, prefix, other)

Fixes part of #1068.

emoji: Add list of the "popular" emoji

emoji: Rank "popular" > custom > other emoji

Fixes part of #1068.

emoji: Recognize word-aligned matches in ranking

Fixes #1068.

emoji: Order "popular" emoji canonically amongst themselves

As a bonus, this provides the popular emoji as candidates
even when we haven't yet fetched the server's emoji data.

gnprice · 2024-12-09T01:53:23Z

(There may also be differences in the base ordering within custom emoji or within Unicode emoji, seen when they match a given query equally well; I haven't yet studied those to pin them down.)

OK, did that too. The result was discovering #1113, fixed in #1114; and some other differences. Quoting a comment added in #1114:

    // Behavior differences we might copy or change, TODO:
    //  * Web has a particular ordering of Unicode emoji;
    //    a data file groups them by category and orders within each of those,
    //    and the code has a list of categories.
    //    This seems useful; it'll call for expanding the server emoji data API.
    //  * Both here and in web, the realm emoji appear in whatever order the
    //    server returned them in; and that order appears to be random,
    //    presumably the iteration order of some Python dict,
    //    and to vary over time.
    //
    // Behavior differences that web should probably fix, TODO(web):
    //  * Web ranks the realm's custom emoji (and the Zulip extra emoji) at the
    //    end of the base list, as seen in the emoji picker on an empty query;
    //    but then ranks them first, after only the six "popular" emoji,
    //    once there's a non-empty query.
    //  * Web gives the six "popular" emoji a set order amongst themselves,
    //    like we do after #1112; but in web, this order appears only in the
    //    emoji picker on an empty query, and is otherwise lost even when the
    //    emoji are taken out of their home categories and shown instead
    //    together at the front.
    //
    //    In web on an empty query, :+1: aka :like: comes first, and
    //    :heart: aka :love: comes later (fourth); but then on the query "l",
    //    the results begin with :love: and then :like:.  They've flipped order,
    //    even though they're equally good prefix matches to the query.

We'll soon (for zulip#1068) be adding logic that distinguishes these emoji from other Unicode emoji. That would break some test cases which refer to an emoji that happens to be "popular", like 😄, when they really just intend the generic behavior that happens to any Unicode emoji.

These searches for 'h' and 's' would have many results in a real emoji database; in particular they'd have multiple results even in an emoji database that has just the six "popular" emoji. We'll soon make a change that causes those "popular" emoji to always be present. To prepare these tests, make the queries more specific.

This is NFC except for performance. This is an extremely common case -- it happens every time the user opens the emoji picker, and applies to every emoji -- so worth optimizing.

This is NFC except for performance: it saves us from having to construct this string again for each emoji name. At this stage there's a (much smaller) performance loss, too: we now compute this string even when we'll never actually use it because the query doesn't contain the separator. But we'll soon start checking for word-aligned matches even for those queries where we'd accept a non-word-aligned match; once we do, we'll need this string for all nonempty queries, for almost all names.

Specifically the sort takes linear time so long as there aren't more than linearly-many buckets. For our immediate use case of ranking emoji-autocomplete results, we'll in fact stick to a constant number of buckets.

This logic will grow to handle ranking, and it's closely tied to the matching logic that lives on the query.

…y private In the app, the only entry point to this logic should be through EmojiAutocompleteView. The method is public only because that made it more convenient to write thorough unit tests.

We'll need this as we add more logic here.

This makes these tests a bit easier to read already, with less noisy repetition; and it prepares them for adding more complexity to how this matching works.

This actually has a pretty nice effect on the readability of these tests, even at this stage where the enum isn't doing anything! Separate from the parent commit just because this one is a bigger, and almost entirely mechanical and boring, diff.

Fixes part of zulip#1068.

Fixes zulip#1068.

As a bonus, this provides the popular emoji as candidates even when we haven't yet fetched the server's emoji data.

gnprice · 2024-12-09T21:21:52Z

Rebased following the merge of #1114. Resolved conflicts, fixed some of that PR's tests affected by this one, and updated some comments.

chrisbobbe · 2024-12-09T21:40:59Z

Thanks! LGTM, please merge at will.

I notice we don't give any results on CZO for these queries:

yoyo
yo-yo
thankyou
thank-you

even though the emojis "yo_yo" and "thank_you" exist. Maybe not ideal 🤷‍♂️ but not a priority since this is web's behavior too.

gnprice · 2024-12-09T21:49:03Z

Thanks for the review! Merged.

Yeah, it might be reasonable to have a query like "thankyou" match the name thank_you. I think it's pretty natural to type the space, though: "thank you". And I feel like I usually expect a search UI to require separate words in the query in order to match separate words in the text.

(The "yoyo" example has perhaps a confounding factor that the actual word is often spelled like that, as well as "yo-yo". But the solution to that would be to have an emoji named yoyo.)

In any case, that'd be a change to make across both web and mobile.

gnprice added the maintainer review PR ready for review by Zulip maintainers label Dec 8, 2024

gnprice requested a review from chrisbobbe December 8, 2024 23:15

This was referenced Dec 8, 2024

Rank emoji autocomplete results #1068

Closed

Support adding arbitrary reactions #1103

Merged

Deactivated custom emoji offered in autocomplete #1113

Closed

Exclude deactivated realm emoji from autocomplete #1114

Merged

gnprice added 18 commits December 9, 2024 13:06

emoji: Short-circuit matching an empty query

ab50a5d

This is NFC except for performance. This is an extremely common case -- it happens every time the user opens the emoji picker, and applies to every emoji -- so worth optimizing.

algorithms: Add bucketSort, for stable, linear-time sorting

b67e172

Specifically the sort takes linear time so long as there aren't more than linearly-many buckets. For our immediate use case of ranking emoji-autocomplete results, we'll in fact stick to a constant number of buckets.

emoji [nfc]: Move _testCandidate logic onto query class

eb26c84

This logic will grow to handle ranking, and it's closely tied to the matching logic that lives on the query.

emoji [nfc]: Mark query-matches method visibleForTesting, conceptuall…

3cd8825

…y private In the app, the only entry point to this logic should be through EmojiAutocompleteView. The method is public only because that made it more convenient to write thorough unit tests.

emoji [nfc]: Make testCandidate visible for testing

a077e9a

We'll need this as we add more logic here.

emoji test [nfc]: Pull matchesNames up to wider scope

09813b2

emoji test [nfc]: Tighten query-matches tests

5cc5142

This makes these tests a bit easier to read already, with less noisy repetition; and it prepares them for adding more complexity to how this matching works.

emoji [nfc]: Add ranking framework for emoji autocomplete results

70e132b

emoji: Rank by quality of match (exact, prefix, other)

37652db

Fixes part of zulip#1068.

emoji test [nfc]: Extract helpers realmCandidate, zulipCandidate

70004b0

emoji: Add list of the "popular" emoji

6c93b40

emoji: Rank "popular" > custom > other emoji

bb8935a

Fixes part of zulip#1068.

emoji: Recognize word-aligned matches in ranking

a885520

Fixes zulip#1068.

emoji: Order "popular" emoji canonically amongst themselves

7045afe

As a bonus, this provides the popular emoji as candidates even when we haven't yet fetched the server's emoji data.

gnprice force-pushed the pr-emoji-ranking branch from b14ee1c to 7045afe Compare December 9, 2024 21:19

chrisbobbe self-assigned this Dec 9, 2024

gnprice merged commit 7045afe into zulip:main Dec 9, 2024
1 check passed

gnprice deleted the pr-emoji-ranking branch December 9, 2024 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rank emoji results in autocomplete #1112

Rank emoji results in autocomplete #1112

gnprice commented Dec 8, 2024

gnprice commented Dec 9, 2024

gnprice commented Dec 9, 2024

chrisbobbe commented Dec 9, 2024

gnprice commented Dec 9, 2024

Rank emoji results in autocomplete #1112

Rank emoji results in autocomplete #1112

Conversation

gnprice commented Dec 8, 2024

Selected commit messages

gnprice commented Dec 9, 2024

gnprice commented Dec 9, 2024

chrisbobbe commented Dec 9, 2024

gnprice commented Dec 9, 2024