Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Highlighting returns invalid UTF-8 string containing CESU-8 surrogate byte sequences #12132

Open
sapi33 opened this issue Feb 1, 2024 · 2 comments
Labels
bug Something isn't working Search Search query, autocomplete ...etc

Comments

@sapi33
Copy link

sapi33 commented Feb 1, 2024

Describe the bug

I'm curios whether it is intended (and why) that the text provided by the highlighting encodes 4-byte characters of the SMP plane via two 3-byte sequences. This is not valid UTF-8 but rather seems to be the CESU-8 encoding, which is "... not intended nor recommended as an encoding used for open information exchange".

E.g. if the character "𐑏" (U+10427) is contained in the text to be indexed, then it is returned as-is for the corresponding field but encoded/given as "\uD801\uDC4F" in the output of the highlighting.

Note that is is not visible in the OpenSearch Dashboard though, as there the byte sequence is processed at some point and the expected character is displayed. We observed it when calling the REST API from a native application. It can also be observed when querying e.g. with curl, inspecting the raw output in Postman also works.

Related component

Search

To Reproduce

  1. Insert an entry in an index where a text field contains a 4-byte character of the SMP plane
  2. Query this entry with the highlight parameter via curl such that the character in step (1.) is part of the provided text

Expected behavior

I would have assumed that the whole response is valid UTF-8, especially as also JSON is defined for UTF-8.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: RHEL9
  • Version OpenSearch 2.11.0

Additional context
Add any other context about the problem here.

@sapi33 sapi33 added bug Something isn't working untriaged labels Feb 1, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Feb 1, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@sapi33 Thanks for filing, we'd welcome a pull request to address this issue.

@sapi33
Copy link
Author

sapi33 commented Feb 20, 2024

Hi @peternied,
I would like to make a contribution, but unfortunately I am not experienced enough in Java and so far clearly on the user side with regard to OpenSearch. But it would be great if a maintainer could take a look at this issue.

@getsaurabh02 getsaurabh02 moved this from 🆕 New to Later (6 months plus) in Search Project Board Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc
Projects
Status: Later (6 months plus)
Development

No branches or pull requests

2 participants