[BUG] Highlighting returns invalid UTF-8 string containing CESU-8 surrogate byte sequences #12132

sapi33 · 2024-02-01T16:03:49Z

Describe the bug

I'm curios whether it is intended (and why) that the text provided by the highlighting encodes 4-byte characters of the SMP plane via two 3-byte sequences. This is not valid UTF-8 but rather seems to be the CESU-8 encoding, which is "... not intended nor recommended as an encoding used for open information exchange".

E.g. if the character "𐑏" (U+10427) is contained in the text to be indexed, then it is returned as-is for the corresponding field but encoded/given as "\uD801\uDC4F" in the output of the highlighting.

Note that is is not visible in the OpenSearch Dashboard though, as there the byte sequence is processed at some point and the expected character is displayed. We observed it when calling the REST API from a native application. It can also be observed when querying e.g. with curl, inspecting the raw output in Postman also works.

Related component

Search

To Reproduce

Insert an entry in an index where a text field contains a 4-byte character of the SMP plane
Query this entry with the highlight parameter via curl such that the character in step (1.) is part of the provided text

Expected behavior

I would have assumed that the whole response is valid UTF-8, especially as also JSON is defined for UTF-8.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: RHEL9
Version OpenSearch 2.11.0

Additional context
Add any other context about the problem here.

peternied · 2024-02-14T16:54:33Z

[Triage - attendees 1 2 3 4 5 6 7 8]
@sapi33 Thanks for filing, we'd welcome a pull request to address this issue.

sapi33 · 2024-02-20T07:38:59Z

Hi @peternied,
I would like to make a contribution, but unfortunately I am not experienced enough in Java and so far clearly on the user side with regard to OpenSearch. But it would be great if a maintainer could take a look at this issue.

sapi33 added bug Something isn't working untriaged labels Feb 1, 2024

github-actions bot added the Search Search query, autocomplete ...etc label Feb 1, 2024

github-project-automation bot added this to Search Project Board Feb 1, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Feb 1, 2024

peternied removed the untriaged label Feb 14, 2024

getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Highlighting returns invalid UTF-8 string containing CESU-8 surrogate byte sequences #12132

[BUG] Highlighting returns invalid UTF-8 string containing CESU-8 surrogate byte sequences #12132

sapi33 commented Feb 1, 2024 •

edited

Loading

peternied commented Feb 14, 2024

sapi33 commented Feb 20, 2024

[BUG] Highlighting returns invalid UTF-8 string containing CESU-8 surrogate byte sequences #12132

[BUG] Highlighting returns invalid UTF-8 string containing CESU-8 surrogate byte sequences #12132

Comments

sapi33 commented Feb 1, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

peternied commented Feb 14, 2024

sapi33 commented Feb 20, 2024

sapi33 commented Feb 1, 2024 •

edited

Loading