Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running as a service #17

Open
feloy opened this issue Oct 10, 2024 · 11 comments
Open

Running as a service #17

feloy opened this issue Oct 10, 2024 · 11 comments

Comments

@feloy
Copy link

feloy commented Oct 10, 2024

Thanks for this amazing work!

Would you be interested to have a --service mode, to be able to run llama3.java as a service, and have a third-party chat communicating with this service?

@mukel
Copy link
Owner

mukel commented Oct 10, 2024

There's #15 with a --server flag by @srogmann , it needs still some work, but the idea is similar.
What kind of API would you expose and why? Please note that I'm noob on this side of things.

@feloy
Copy link
Author

feloy commented Oct 10, 2024

Yes, this PR seems to be what I was thinking about. The API I would expect is a llama-cpp compatible one, as in the PR.

The use case would be to have the choice between different inference servers in https://github.com/containers/podman-desktop-extension-ai-lab

@stephanj
Copy link

Having a compliant OpenAI (chat) REST API would be amazing. This would allow many tools (including LangChain4J) to integrate without any extra code with Llama3.java
See also https://platform.openai.com/docs/api-reference/chat

@geoand
Copy link

geoand commented Oct 22, 2024

I personally think it makes more sense for this project to be usable as a library (which requires making the API clear) which then can be embedded inside other libraries / frameworks to provide a REST API (compatibility with OpenAI makes 100% sense to me).

@stephanj
Copy link

Agreed, similar to what I've done as an experiment @ https://github.com/stephanj/Llama3JavaChatCompletionService
But then better 😂

@geoand
Copy link

geoand commented Oct 23, 2024

I have another question.
Say someone has opted to use mukel/Llama-3.2-3B-Instruct-GGUF.
In that case which quantization should be the default, or is it expected that users need to provide that as well?

@mukel
Copy link
Owner

mukel commented Oct 23, 2024

GGUF files come pre-quantized.
ollama has a notion of "default" quantization that varies across models e.g. some smaller models use Q8_0, while larger models default to Q4_K_M...
It is more complicated because a model may state that it is quantized with Q8_0, but it may contain tensors quantized with other methods e.g. most Q4_0 quantizations in HuggingFace include some tensors quantized with Q6_K (see initial implementation in #12).
IMHO, the "default" should be the smallest possible "acceptable" quantization for that model ... but then we have to define what "acceptable" means.

@geoand
Copy link

geoand commented Oct 23, 2024

I see, thanks for the input!

So I guess it makes sense to have the user choose which quantization they want?

@geoand
Copy link

geoand commented Oct 25, 2024

Another question if I may:

Say we obtain a list of request - response messages from chat history and want Llama3.java to be aware of those. What is the proper way to interact with Llama3.java in this case?
Should we use encodeDialogPrompt for this case?

@mukel
Copy link
Owner

mukel commented Oct 25, 2024

Yes, but ingesting all the tokens again and again is wasteful. Note that this is not a problem for cloud providers because you pay per token and token ingestion is very fast (more so on GPUs). If they keep the KV caches around for a bit, they save for themselves.
I'd like to have transparent caching for prompts and conversations. When you create the model in e.g. LangChain4j you could specify a caching strategy for prompts/conversations ... it's not clear to me what would be a good way (API) to specify what must be cached and how (persists to disk, keep KV caches in memory ...)

Also, #16 introduces prompt caching to disk.

@geoand
Copy link

geoand commented Oct 25, 2024

Oh, that's very interesting to know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants