Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(android): some unicode characters not working #65

Closed
wants to merge 1 commit into from

Conversation

SlackCrow
Copy link

The original code was using NewStringUTF which caused the app to crash when certain unicode characters, such as emojis, were used.

So instead I added a helper function that converts the UTF-8 input into jstring manually.

The original code was using NewStringUTF which caused the app to crash when certain unicode characters, such as emojis, were used. 

So instead I added a helper function that converts the UTF-8 input into jstring manually.
@jhen0409 jhen0409 changed the title fix: some unicode characters not working on Android fix(android): some unicode characters not working Jul 15, 2024
@jhen0409
Copy link
Member

Hi, thank you for the PR!

I can see the correctly completion result, but it got unexpected result in the partial token event. We already handled with waiting 2-4 bytes and send token event:

llama.rn/cpp/rn-llama.hpp

Lines 464 to 482 in 4c90a78

const char c = token_text[0];
// 2-byte characters: 110xxxxx 10xxxxxx
if ((c & 0xE0) == 0xC0)
{
multibyte_pending = 1;
// 3-byte characters: 1110xxxx 10xxxxxx 10xxxxxx
}
else if ((c & 0xF0) == 0xE0)
{
multibyte_pending = 2;
// 4-byte characters: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
}
else if ((c & 0xF8) == 0xF0)
{
multibyte_pending = 3;
}
else
{
multibyte_pending = 0;

It would be great if you could check this problem out, I'll do so when I available.

@jhen0409 jhen0409 added the bug Something isn't working label Jul 15, 2024
@SlackCrow
Copy link
Author

SlackCrow commented Jul 18, 2024

Could you elaborate what you meant by partial token events? It might be due to the function I wrote not having a lot of error handlings for invalid UTF-8 characters(in fact it only has one for checking whether the input character is null or not). I didn't include it because as far as I know llama.cpp treats all token values as unicode, but I could be wrong.

@jhen0409
Copy link
Member

Could you elaborate what you meant by partial token events?

This:

if (
stop_pos == std::string::npos ||
// Send rest of the text if we are at the end of the generation
(!llama->has_next_token && !is_stop_full && stop_pos > 0)
) {
const std::string to_send = llama->generated_text.substr(pos, std::string::npos);
sent_count += to_send.size();
std::vector<rnllama::completion_token_output> probs_output = {};
auto tokenResult = createWriteableMap(env);
putString(env, tokenResult, "token", to_send.c_str());
if (llama->params.sparams.n_probs > 0) {
const std::vector<llama_token> to_send_toks = llama_tokenize(llama->ctx, to_send, false);
size_t probs_pos = std::min(sent_token_probs_index, llama->generated_token_probs.size());
size_t probs_stop_pos = std::min(sent_token_probs_index + to_send_toks.size(), llama->generated_token_probs.size());
if (probs_pos < probs_stop_pos) {
probs_output = std::vector<rnllama::completion_token_output>(llama->generated_token_probs.begin() + probs_pos, llama->generated_token_probs.begin() + probs_stop_pos);
}
sent_token_probs_index = probs_stop_pos;
putArray(env, tokenResult, "completion_probabilities", tokenProbsToMap(env, llama, probs_output));
}
jclass cb_class = env->GetObjectClass(partial_completion_callback);
jmethodID onPartialCompletion = env->GetMethodID(cb_class, "onPartialCompletion", "(Lcom/facebook/react/bridge/WritableMap;)V");
env->CallVoidMethod(partial_completion_callback, onPartialCompletion, tokenResult);
}

Our example also use it.

@jhen0409
Copy link
Member

3799cf8 should fix the unicode issue and the issue about partial event, so this handle may not be need. Still thank you for the PR!

@jhen0409 jhen0409 closed this Jul 27, 2024
@SlackCrow
Copy link
Author

No problem, thank you for fixing the code! I was a bit busy last few days and couldn't really check it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants