Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mention start and length are measured in terms of UTF-16 code units, not code points or bytes #1504

Closed
brenns10 opened this issue Apr 12, 2024 · 2 comments

Comments

@brenns10
Copy link
Contributor

Hello! I'm not sure whether to call this a bug, missing documentation, or user error, but I wanted to at least share an issue I've observed relating to mention indices and string encodings.

Description of the issue

For incoming messages, mentions are provided as an array of objects with "start" and "length" field, representing the substring to replace with the mention. However, it's unspecified what the units are for start and length. I had personally expected one of two options: (a) Unicode code points, or (b) bytes in the UTF-8 encoded string. Though, (a) seemed far more likely.

However, the answer is neither (a) nor (b)... instead, the unit seems to be in UTF-16 code units (c). For unicode code points which are beyond the basic multilingual plane (i.e. code points >= 0x10000), the UTF-16 representation involves a "surrogate pair" of UTF-16 code units. A common example of characters beyond the BMP are Emoji. To illustrate this, imagine the message: hello 💩 @user. Here's a table showing each character in the string that signal-cli would return for this message, along with its indices for (a), (b), and (c).

char  unicode  (a)unicode#  (b)utf8byte#  (c)utf-16#
h     U+68     0            0             0
e     U+65     1            1             1
l     U+6C     2            2             2
l     U+6C     3            3             3
o     U+6F     4            4             4
SPC   U+20     5            5             5
💩    U+1F4A9  6            6-9           6-7
SPC   U+20     7            10            8
@user U+FFFC   8            11-13         9

The first column is the character (or some representation of it). The second column is the unicode code point. The third column is the index of that character if you're counting by (a) unicode code points. The fourth column is the index of that character if you're counting by (b) utf-8 encoded bytes. The fifth column is the index of that character if you're counting by (c) UTF-16 code units.

The @user mention seems to be commonly represented with U+FFFC ("Object replacement character"). That code point is within the BMP, so it is represented by one UTF-16 code unit (but three bytes in UTF-8). The 💩 emoji is U+1F4A9 which is beyond the BMP, so it is represented by a surrogate pair of UTF-16 code units, and four UTF-8 bytes.

Here's the array of mentions which signal-cli's jsonRpc gives for that string (replaced the identifiers for privacy):

[
  {
    "name": "+1XXXXXXXXXX",
    "number": "+1XXXXXXXXXX",
    "uuid": "00000000-0000-0000-0000-000000000000",
    "start": 9,
    "length": 1
  }
]

As you can see, signal-cli says that the starting index for the mention is 9 -- which can only be correct if we're counting by (c) UTF-16 code units!

The reason I find this behavior unexpected and confusing is that the actual emoji appears in the JSON UTF-8 encoded! So there's no reason for an application to expect that it should be treating these string indices as UTF-16 code units. If I had to guess, the reason it is this way is because Java internally uses UTF-16 to represents strings, so the indices are done this way to match the way that you would index them in Java. They are represented this way in signald as well, so if I had to guess, these numbers are probably coming directly from the signal protocol/library, and not computed by signal-cli itself.

Impact

For applications written in Python, string indexing is done by code points, so properly-written client software which is using the provided indices would fail. For example, the above message might result in an IndexError since there is no code point at index 9. I'd imagine this happens with other languages that internally represent strings as a sequence of unicode code points.

This can also cause issues when sending messages that have an emoji followed by a mention... if you assumed that the indices were based on unicode code points, you'll find that the message that gets delivered will have replaced the wrong text with your mention, which will make everything look odd.

My use case happens to be in C (I know, I know 😛) so everything is encoded bytes, but as long as I know how the indices should be interpreted, I can handle it myself just fine.

But in general, I don't know what makes the most sense... should signal-cli internally convert "start" and "length" to unicode code points? (How to do that without breaking compatibility with clients using the current representation??) Or should this just be documented somewhere and left alone? (probably the right answer).

signal-cli version info

Linux x86_64, using the regular build from the Github releases page:

$ bin/signal-cli --version
signal-cli 0.13.2
@AsamK
Copy link
Owner

AsamK commented Apr 12, 2024

Yes, those are measured in UTF-16 code units, because that's what the Signal protocol uses, which got the behavior from Android/Java.
Text style ranges also behave the same way.

As there's no obvious way to index unicode chars (code points, grapheme clusters, UTF-8 bytes, ...) I'll keep it this way.
But you're right the documentation should mention this.

brenns10 added a commit to brenns10/signal-cli that referenced this issue Apr 13, 2024
The unit of UTF-16 code units is not necessarily obvious for users of
languages that index strings by Unicode code points. Provide a pointer
to an FAQ entry as well:

https://github.com/AsamK/signal-cli/wiki/FAQ#string-indexing-units

Closes AsamK#1504

Signed-off-by: Stephen Brennan <[email protected]>
@brenns10
Copy link
Contributor Author

Thanks for confirming I'm not crazy! And I do agree after writing it all out, keeping it unchanged makes sense. It's not difficult to get it right as a client so long as you know the behavior.

I put down something in the FAQ to document this and submitted #1505 to update send --mention and send --text-style documentation to mention it. Hope this helps.

@AsamK AsamK closed this as completed in e5ebb73 Apr 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants