-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running as a service #17
Comments
Yes, this PR seems to be what I was thinking about. The API I would expect is a llama-cpp compatible one, as in the PR. The use case would be to have the choice between different inference servers in https://github.com/containers/podman-desktop-extension-ai-lab |
Having a compliant OpenAI (chat) REST API would be amazing. This would allow many tools (including LangChain4J) to integrate without any extra code with Llama3.java |
I personally think it makes more sense for this project to be usable as a library (which requires making the API clear) which then can be embedded inside other libraries / frameworks to provide a REST API (compatibility with OpenAI makes 100% sense to me). |
Agreed, similar to what I've done as an experiment @ https://github.com/stephanj/Llama3JavaChatCompletionService |
I have another question. |
GGUF files come pre-quantized. |
I see, thanks for the input! So I guess it makes sense to have the user choose which quantization they want? |
Another question if I may: Say we obtain a list of request - response messages from chat history and want Llama3.java to be aware of those. What is the proper way to interact with Llama3.java in this case? |
Yes, but ingesting all the tokens again and again is wasteful. Note that this is not a problem for cloud providers because you pay per token and token ingestion is very fast (more so on GPUs). If they keep the KV caches around for a bit, they save for themselves. Also, #16 introduces prompt caching to disk. |
Oh, that's very interesting to know |
Thanks for this amazing work!
Would you be interested to have a
--service
mode, to be able to run llama3.java as a service, and have a third-party chat communicating with this service?The text was updated successfully, but these errors were encountered: