Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAIGenerator uses chat_completions endpoint. Error with model that has no chat_template in config #8275

Open
Permafacture opened this issue Aug 23, 2024 · 6 comments
Assignees
Labels
P2 Medium priority, add to the next sprint if no P1 available

Comments

@Permafacture
Copy link

Describe the bug
I'm using the OpenAIGenerator to access a vLLM endpoint on runpod. When using a base model like Mistral v0.3 that has not been instruction tuned and so does not have a chat template in it's config for the tokenizer, I get an error returned from the api endpoint. Digging into this I see that the OpenAIGenerator uses the chat_completion/ endpoint for the OpenAIGenerator and not the completion/ endpoint. This means I've been unintentionally using a chat template with other models up to this point.

Error message
"Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed!"

Expected behavior
I expected for the completions/ api endpoint to be used and the hugging face model to not try to use apply_chat_template()

Additional context
I tried to use the client.completions method directly as a work around.

completion = generator.client.completions.create(model=generator.model, prompt="And then, something unexpected happened.", **generator.generation_kwargs)

The process on the server crashes with a 'NoneType' object has no attribute 'headers'.

System:

  • Haystack version (commit or version number): haystack-ai==2.2.0
@lbux
Copy link
Contributor

lbux commented Aug 24, 2024

Can you please provide some sample code to try and reproduce the error?

I understand why it is happening (completions API is legacy and might stop being supposed by OpenAI). There are ways to get the chat completions endpoint to mimic the completions one and that is what Haystack tries to do, but I'd need an example to see if the issue is with vLLM or Haystack.

@Permafacture
Copy link
Author

The header error is definitely on vLLM, or at least the fork the runpod folks are using. But I don't think it's right to have the text completion class use the chat completion endpoint. If the completions endpoint gets removed then it's best to let the calls fail and inform the user rather than use a different endpoint in my opinion. I was getting weird responses and I wouldn't have ever known why it f I didn't try a model that didn't have a chat template.

Like if you prompt "it was a normal summer day until something unexpected happened" the chat endpoint will respond "what happened?" rather than continue the story.

If you want to keep things as is to not break existing users code you could add a boolean kwarg raw or something just so the behavior is documented and users have the option of using a completions endpoint.

@lbux
Copy link
Contributor

lbux commented Aug 24, 2024

I definitely see benefits and downsides to using the basic completions vs the chat completions for the regular generator.

Using OpenAIGenerator and a "prompt" while then converting it to ChatMessage in the backend allows for users to quickly try the generators without having to worry about roles. This also allows users to be able to use the most recent models available from OpenAI (4o and mini are not available in the regular completions api).

In regard to completions... some models are definitely smart enough to finish typing what you write, and you can reinforce it by setting a system prompt that tells it how exactly to complete it.

And then when it comes to setting the api_base_url and templates, since the chat completions endpoint is being used, some implementations of an open ai api compatible server may handle it differently. For example, this is how Ollama handles it:

By default, models imported into Ollama have a default template of {{ .Prompt }}, i.e. user inputs are sent verbatim to the LLM. This is appropriate for text or code completion models but lacks essential markers for chat or instruction models.

This means that Ollama can effectively mimic a completions call to the chat completions api despite a model not having a template (like base models do). vLLM does not seem to take this approach and unless they want to provide a default "fallback" template, you will probably need to provide your own that does what Ollama implemented.

It may be possible to set a flag as you specified and conditionally call the regular completions api (add it to generation_kwargs and extract it if present), but I don't believe it should be default behavior since most users are probably not using api_base_url.

I'll leave the rest to the Haystack team to see how they wish to proceed.

@vblagoje
Copy link
Member

vblagoje commented Sep 4, 2024

cc @julian-risch to assign in the next sprint

@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Sep 7, 2024
@vblagoje
Copy link
Member

@julian-risch I've read this issue report in detail and understand what @Permafacture is asking for but in the light of our plan to deprecate all generators I wonder how relevant would work on such an issue be! I recommend closing with "Won't fix".

@julian-risch
Copy link
Member

So far it's only an idea. We have not decided yet whether to change anything about the generators. I'll move the issue to "Hold" in the mean time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available
Projects
None yet
Development

No branches or pull requests

4 participants