You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm curios whether it is intended (and why) that the text provided by the highlighting encodes 4-byte characters of the SMP plane via two 3-byte sequences. This is not valid UTF-8 but rather seems to be the CESU-8 encoding, which is "... not intended nor recommended as an encoding used for open information exchange".
E.g. if the character "𐑏" (U+10427) is contained in the text to be indexed, then it is returned as-is for the corresponding field but encoded/given as "\uD801\uDC4F" in the output of the highlighting.
Note that is is not visible in the OpenSearch Dashboard though, as there the byte sequence is processed at some point and the expected character is displayed. We observed it when calling the REST API from a native application. It can also be observed when querying e.g. with curl, inspecting the raw output in Postman also works.
Related component
Search
To Reproduce
Insert an entry in an index where a text field contains a 4-byte character of the SMP plane
Query this entry with the highlight parameter via curl such that the character in step (1.) is part of the provided text
Expected behavior
I would have assumed that the whole response is valid UTF-8, especially as also JSON is defined for UTF-8.
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
OS: RHEL9
Version OpenSearch 2.11.0
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Hi @peternied,
I would like to make a contribution, but unfortunately I am not experienced enough in Java and so far clearly on the user side with regard to OpenSearch. But it would be great if a maintainer could take a look at this issue.
Describe the bug
I'm curios whether it is intended (and why) that the text provided by the highlighting encodes 4-byte characters of the SMP plane via two 3-byte sequences. This is not valid UTF-8 but rather seems to be the CESU-8 encoding, which is "... not intended nor recommended as an encoding used for open information exchange".
E.g. if the character "𐑏" (U+10427) is contained in the text to be indexed, then it is returned as-is for the corresponding field but encoded/given as "\uD801\uDC4F" in the output of the highlighting.
Note that is is not visible in the OpenSearch Dashboard though, as there the byte sequence is processed at some point and the expected character is displayed. We observed it when calling the REST API from a native application. It can also be observed when querying e.g. with curl, inspecting the raw output in Postman also works.
Related component
Search
To Reproduce
Expected behavior
I would have assumed that the whole response is valid UTF-8, especially as also JSON is defined for UTF-8.
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: