Unicode errors (e.g., f��r instead of für) when using "fast" citation quality in RAG flow #572

ulan-yisaev · 2024-09-11T12:43:11Z

SDK Version (required)
5.9.1

Describe the bug
When using the "fast" option for citation_quality in the RAG flow with the Cohere API, Unicode errors occur, but only for certain characters when they are split across multiple tokens in a streaming response. This causes multi-byte characters like ü, ö, ä to be incorrectly displayed as the Unicode replacement character (�), even though the same characters are displayed correctly elsewhere in the response. This issue disappears when citation_quality is set to "accurate".

Examples:

Incorrect encoding in `citation_quality: "fast"`

Lösungen und beschäftigen:

Expected text: "Lösungen und beschäftigen"

Received stream:

{"message": "L\ufffd"}
{"message": "\ufffdsung"}
{"message": "en u"}
{"message": "nd besch\ufffd"}
{"message": "\ufffdfti"}
{"message": "gen m"}

Körpertemperatur:

Expected text: "Körpertemperatur"

Received stream:

{"message": "e K\ufffd"}
{"message": "\ufffdrpe"}
{"message": "rtemp"}
{"message": "erat"}
{"message": "ur"}

Frühsommer:

Expected text: "Frühsommer"

Received stream:

{"message": "Fr"}
{"message": "\ufffd"}
{"message": "\ufffdh"}
{"message": "so"}
{"message": "mme"}
{"message": "r"}

As shown, characters like ö, ü, and ä are split between messages, causing them to be replaced with \ufffd, which represents an invalid character or decoding error.

Expected Behavior
Special characters should be handled and displayed correctly, even when split across tokens in a streaming response, regardless of the citation_quality setting.

Actual Behavior
When citation_quality is set to "fast", characters that are split across tokens (especially multi-byte characters like ü, ö, ä) are incorrectly displayed as the Unicode replacement character (� or \ufffd).

Screenshots
N/A

Workaround
Setting citation_quality to "accurate" resolves the issue, but at the cost of performance.

The text was updated successfully, but these errors were encountered:

daniel-cohere · 2024-09-25T19:35:55Z

Hi - I've failed to reproduce this issue - could you share a sample request that consistently reproduces the issue and I can investigate more? Thanks!

frankpepermans · 2024-09-28T12:34:39Z

We're also experiencing this issue, but only after using the Cohere model in the Azure marketplace.

Before we were using the model directly from Cohere, and I remember we did have unicode errors as described in the beginning, but the latest version was fine (in Fast mode).

It's also worth noting that the Citation start and end ranges have an offset, once the above bug is encountered.

mkozakov · 2024-11-22T02:56:54Z

@frankpepermans @ulan-yisaev we are still unable to reproduce the issue. do you have a request we could try to repro with?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode errors (e.g., f��r instead of für) when using "fast" citation quality in RAG flow #572

Unicode errors (e.g., f��r instead of für) when using "fast" citation quality in RAG flow #572

ulan-yisaev commented Sep 11, 2024

daniel-cohere commented Sep 25, 2024

frankpepermans commented Sep 28, 2024

mkozakov commented Nov 22, 2024

Unicode errors (e.g., f��r instead of für) when using "fast" citation quality in RAG flow #572

Unicode errors (e.g., f��r instead of für) when using "fast" citation quality in RAG flow #572

Comments

ulan-yisaev commented Sep 11, 2024

Examples:

Incorrect encoding in citation_quality: "fast"

daniel-cohere commented Sep 25, 2024

frankpepermans commented Sep 28, 2024

mkozakov commented Nov 22, 2024

Incorrect encoding in `citation_quality: "fast"`