Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode errors (e.g., f��r instead of für) when using "fast" citation quality in RAG flow #572

Open
ulan-yisaev opened this issue Sep 11, 2024 · 3 comments

Comments

@ulan-yisaev
Copy link

SDK Version (required)
5.9.1

Describe the bug
When using the "fast" option for citation_quality in the RAG flow with the Cohere API, Unicode errors occur, but only for certain characters when they are split across multiple tokens in a streaming response. This causes multi-byte characters like ü, ö, ä to be incorrectly displayed as the Unicode replacement character (), even though the same characters are displayed correctly elsewhere in the response. This issue disappears when citation_quality is set to "accurate".

Examples:

Incorrect encoding in citation_quality: "fast"

  1. Lösungen und beschäftigen:

    • Expected text: "Lösungen und beschäftigen"
    • Received stream:
      {"message": "L\ufffd"}
      {"message": "\ufffdsung"}
      {"message": "en u"}
      {"message": "nd besch\ufffd"}
      {"message": "\ufffdfti"}
      {"message": "gen m"}
  2. Körpertemperatur:

    • Expected text: "Körpertemperatur"
    • Received stream:
      {"message": "e K\ufffd"}
      {"message": "\ufffdrpe"}
      {"message": "rtemp"}
      {"message": "erat"}
      {"message": "ur"}
  3. Frühsommer:

    • Expected text: "Frühsommer"
    • Received stream:
      {"message": "Fr"}
      {"message": "\ufffd"}
      {"message": "\ufffdh"}
      {"message": "so"}
      {"message": "mme"}
      {"message": "r"}

As shown, characters like ö, ü, and ä are split between messages, causing them to be replaced with \ufffd, which represents an invalid character or decoding error.

Expected Behavior
Special characters should be handled and displayed correctly, even when split across tokens in a streaming response, regardless of the citation_quality setting.

Actual Behavior
When citation_quality is set to "fast", characters that are split across tokens (especially multi-byte characters like ü, ö, ä) are incorrectly displayed as the Unicode replacement character ( or \ufffd).

Screenshots
N/A

Workaround
Setting citation_quality to "accurate" resolves the issue, but at the cost of performance.

@daniel-cohere
Copy link
Contributor

Hi - I've failed to reproduce this issue - could you share a sample request that consistently reproduces the issue and I can investigate more? Thanks!

@frankpepermans
Copy link

We're also experiencing this issue, but only after using the Cohere model in the Azure marketplace.

Before we were using the model directly from Cohere, and I remember we did have unicode errors as described in the beginning, but the latest version was fine (in Fast mode).

It's also worth noting that the Citation start and end ranges have an offset, once the above bug is encountered.

@mkozakov
Copy link
Collaborator

@frankpepermans @ulan-yisaev we are still unable to reproduce the issue. do you have a request we could try to repro with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants