Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rank emoji results in autocomplete #1112

Merged
merged 18 commits into from
Dec 9, 2024
Merged

Conversation

gnprice
Copy link
Member

@gnprice gnprice commented Dec 8, 2024

Fixes #1068.

As is, this PR applies to the results of autocomplete in the compose box (when you type a colon :). In combination with #1103 adding an emoji picker, it applies to the results in the emoji picker too, because that PR reuses the same autocomplete machinery.

This mimics the ranking in the web client (from shared/src/typeahead.ts) with a handful of small discrepancies. Almost all are (minor) bugs in web:

    // Behavior differences we might copy, TODO:
    //  * Web ranks each name of a Unicode emoji separately.
    //  * Web recognizes a word-aligned match starting after [ /-] as well as [_].
    //
    // Behavior differences that web should probably fix, TODO(web):
    //  * Among popular emoji with non-exact matches,
    //    web doesn't prioritize prefix over word-aligned; we do.
    //    (This affects just one case: for query "o",
    //    we put :octopus: before :working_on_it:.)
    //  * Web only counts an emoji as "popular" for ranking if the query
    //    is a prefix of a single word in the name; so "thumbs_" or "working_on_i"
    //    lose the ranking boost for :thumbs_up: and :working_on_it: respectively.
    //  * Web starts with only case-sensitive exact matches ("perfect matches"),
    //    and puts case-insensitive exact matches just ahead of prefix matches;
    //    it also distinguishes prefix matches by case-sensitive vs. not.
    //    We use case-insensitive matches throughout;
    //    case seems unhelpful for emoji search.
    //  * Web suppresses Unicode emoji names shadowed by a realm emoji
    //    only if the latter is also a match for the query.  That mostly works,
    //    because emoji with the same name will mostly both match or both not;
    //    but it breaks if the Unicode emoji was a literal match.

(There may also be differences in the base ordering within custom emoji or within Unicode emoji, seen when they match a given query equally well; I haven't yet studied those to pin them down.)

The details of this implementation work a bit differently from those in web. I think this one is easier to follow, and it's certainly more efficient. Rather than iterate through the list of candidates again and again to evaluate various criteria, partitioning into sub-lists and sub-sub-lists, for each query we

  • give each result a single ranking score once, based on all the criteria;
  • then sort by rank, using a stable sort.

Similarly, rather than check for different kinds of matches once to decide if the emoji matches the query at all, and separately later to help decide how to rank it among matching results, we check for matches just once and return an EmojiMatchQuality enum rather than just a boolean, which records the information we need for ranking too.

I think the duplication of logic within the web implementation, in each of those areas, is the root cause of the bugs listed above — they're discrepancies between one part of web's logic and another, as tends to naturally arise once the same logic is repeated in multiple places in different ways.

Selected commit messages

algorithms: Add bucketSort, for stable, linear-time sorting

Specifically the sort takes linear time so long as there aren't
more than linearly-many buckets. For our immediate use case of
ranking emoji-autocomplete results, we'll in fact stick to a
constant number of buckets.


emoji [nfc]: Add ranking framework for emoji autocomplete results


emoji test [nfc]: Make the EmojiMatchQuality values explicit

This actually has a pretty nice effect on the readability of these
tests, even at this stage where the enum isn't doing anything!

Separate from the parent commit just because this one is a
bigger, and almost entirely mechanical and boring, diff.


emoji: Rank by quality of match (exact, prefix, other)

Fixes part of #1068.


emoji: Add list of the "popular" emoji


emoji: Rank "popular" > custom > other emoji

Fixes part of #1068.


emoji: Recognize word-aligned matches in ranking

Fixes #1068.


emoji: Order "popular" emoji canonically amongst themselves

As a bonus, this provides the popular emoji as candidates
even when we haven't yet fetched the server's emoji data.

@gnprice
Copy link
Member Author

gnprice commented Dec 9, 2024

(There may also be differences in the base ordering within custom emoji or within Unicode emoji, seen when they match a given query equally well; I haven't yet studied those to pin them down.)

OK, did that too. The result was discovering #1113, fixed in #1114; and some other differences. Quoting a comment added in #1114:

    // Behavior differences we might copy or change, TODO:
    //  * Web has a particular ordering of Unicode emoji;
    //    a data file groups them by category and orders within each of those,
    //    and the code has a list of categories.
    //    This seems useful; it'll call for expanding the server emoji data API.
    //  * Both here and in web, the realm emoji appear in whatever order the
    //    server returned them in; and that order appears to be random,
    //    presumably the iteration order of some Python dict,
    //    and to vary over time.
    //
    // Behavior differences that web should probably fix, TODO(web):
    //  * Web ranks the realm's custom emoji (and the Zulip extra emoji) at the
    //    end of the base list, as seen in the emoji picker on an empty query;
    //    but then ranks them first, after only the six "popular" emoji,
    //    once there's a non-empty query.
    //  * Web gives the six "popular" emoji a set order amongst themselves,
    //    like we do after #1112; but in web, this order appears only in the
    //    emoji picker on an empty query, and is otherwise lost even when the
    //    emoji are taken out of their home categories and shown instead
    //    together at the front.
    //
    //    In web on an empty query, :+1: aka :like: comes first, and
    //    :heart: aka :love: comes later (fourth); but then on the query "l",
    //    the results begin with :love: and then :like:.  They've flipped order,
    //    even though they're equally good prefix matches to the query.

We'll soon (for zulip#1068) be adding logic that distinguishes these emoji
from other Unicode emoji.  That would break some test cases which
refer to an emoji that happens to be "popular", like 😄, when
they really just intend the generic behavior that happens to any
Unicode emoji.
These searches for 'h' and 's' would have many results in a real
emoji database; in particular they'd have multiple results even
in an emoji database that has just the six "popular" emoji.

We'll soon make a change that causes those "popular" emoji to
always be present.  To prepare these tests, make the queries
more specific.
This is NFC except for performance.

This is an extremely common case -- it happens every time the
user opens the emoji picker, and applies to every emoji -- so
worth optimizing.
This is NFC except for performance: it saves us from having to
construct this string again for each emoji name.

At this stage there's a (much smaller) performance loss, too:
we now compute this string even when we'll never actually use it
because the query doesn't contain the separator.  But we'll soon
start checking for word-aligned matches even for those queries
where we'd accept a non-word-aligned match; once we do, we'll
need this string for all nonempty queries, for almost all names.
Specifically the sort takes linear time so long as there aren't
more than linearly-many buckets.  For our immediate use case of
ranking emoji-autocomplete results, we'll in fact stick to a
constant number of buckets.
This logic will grow to handle ranking, and it's closely tied to
the matching logic that lives on the query.
…y private

In the app, the only entry point to this logic should be through
EmojiAutocompleteView.  The method is public only because that
made it more convenient to write thorough unit tests.
We'll need this as we add more logic here.
This makes these tests a bit easier to read already, with less
noisy repetition; and it prepares them for adding more complexity
to how this matching works.
This actually has a pretty nice effect on the readability of these
tests, even at this stage where the enum isn't doing anything!

Separate from the parent commit just because this one is a
bigger, and almost entirely mechanical and boring, diff.
As a bonus, this provides the popular emoji as candidates
even when we haven't yet fetched the server's emoji data.
@gnprice
Copy link
Member Author

gnprice commented Dec 9, 2024

Rebased following the merge of #1114. Resolved conflicts, fixed some of that PR's tests affected by this one, and updated some comments.

@chrisbobbe chrisbobbe self-assigned this Dec 9, 2024
@chrisbobbe
Copy link
Collaborator

Thanks! LGTM, please merge at will.

I notice we don't give any results on CZO for these queries:

yoyo
yo-yo
thankyou
thank-you

even though the emojis "yo_yo" and "thank_you" exist. Maybe not ideal 🤷‍♂️ but not a priority since this is web's behavior too.

@gnprice gnprice merged commit 7045afe into zulip:main Dec 9, 2024
1 check passed
@gnprice
Copy link
Member Author

gnprice commented Dec 9, 2024

Thanks for the review! Merged.

Yeah, it might be reasonable to have a query like "thankyou" match the name thank_you. I think it's pretty natural to type the space, though: "thank you". And I feel like I usually expect a search UI to require separate words in the query in order to match separate words in the text.

(The "yoyo" example has perhaps a confounding factor that the actual word is often spelled like that, as well as "yo-yo". But the solution to that would be to have an emoji named yoyo.)

In any case, that'd be a change to make across both web and mobile.

@gnprice gnprice deleted the pr-emoji-ranking branch December 9, 2024 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintainer review PR ready for review by Zulip maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rank emoji autocomplete results
2 participants