Running as a service #17

feloy · 2024-10-10T12:57:50Z

Thanks for this amazing work!

Would you be interested to have a --service mode, to be able to run llama3.java as a service, and have a third-party chat communicating with this service?

The text was updated successfully, but these errors were encountered:

mukel · 2024-10-10T16:31:05Z

There's #15 with a --server flag by @srogmann , it needs still some work, but the idea is similar.
What kind of API would you expose and why? Please note that I'm noob on this side of things.

feloy · 2024-10-10T17:48:24Z

Yes, this PR seems to be what I was thinking about. The API I would expect is a llama-cpp compatible one, as in the PR.

The use case would be to have the choice between different inference servers in https://github.com/containers/podman-desktop-extension-ai-lab

stephanj · 2024-10-13T16:59:46Z

Having a compliant OpenAI (chat) REST API would be amazing. This would allow many tools (including LangChain4J) to integrate without any extra code with Llama3.java
See also https://platform.openai.com/docs/api-reference/chat

geoand · 2024-10-22T11:56:59Z

I personally think it makes more sense for this project to be usable as a library (which requires making the API clear) which then can be embedded inside other libraries / frameworks to provide a REST API (compatibility with OpenAI makes 100% sense to me).

stephanj · 2024-10-22T17:53:34Z

Agreed, similar to what I've done as an experiment @ https://github.com/stephanj/Llama3JavaChatCompletionService
But then better 😂

geoand · 2024-10-23T12:15:48Z

I have another question.
Say someone has opted to use mukel/Llama-3.2-3B-Instruct-GGUF.
In that case which quantization should be the default, or is it expected that users need to provide that as well?

mukel · 2024-10-23T12:36:06Z

GGUF files come pre-quantized.
ollama has a notion of "default" quantization that varies across models e.g. some smaller models use Q8_0, while larger models default to Q4_K_M...
It is more complicated because a model may state that it is quantized with Q8_0, but it may contain tensors quantized with other methods e.g. most Q4_0 quantizations in HuggingFace include some tensors quantized with Q6_K (see initial implementation in #12).
IMHO, the "default" should be the smallest possible "acceptable" quantization for that model ... but then we have to define what "acceptable" means.

geoand · 2024-10-23T12:41:39Z

I see, thanks for the input!

So I guess it makes sense to have the user choose which quantization they want?

geoand · 2024-10-25T06:47:52Z

Another question if I may:

Say we obtain a list of request - response messages from chat history and want Llama3.java to be aware of those. What is the proper way to interact with Llama3.java in this case?
Should we use encodeDialogPrompt for this case?

mukel · 2024-10-25T12:05:12Z

Yes, but ingesting all the tokens again and again is wasteful. Note that this is not a problem for cloud providers because you pay per token and token ingestion is very fast (more so on GPUs). If they keep the KV caches around for a bit, they save for themselves.
I'd like to have transparent caching for prompts and conversations. When you create the model in e.g. LangChain4j you could specify a caching strategy for prompts/conversations ... it's not clear to me what would be a good way (API) to specify what must be cached and how (persists to disk, keep KV caches in memory ...)

Also, #16 introduces prompt caching to disk.

geoand · 2024-10-25T12:13:20Z

Oh, that's very interesting to know

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running as a service #17

Running as a service #17

feloy commented Oct 10, 2024

mukel commented Oct 10, 2024

feloy commented Oct 10, 2024

stephanj commented Oct 13, 2024

geoand commented Oct 22, 2024

stephanj commented Oct 22, 2024

geoand commented Oct 23, 2024 •

edited

Loading

mukel commented Oct 23, 2024

geoand commented Oct 23, 2024

geoand commented Oct 25, 2024 •

edited

Loading

mukel commented Oct 25, 2024

geoand commented Oct 25, 2024

Running as a service #17

Running as a service #17

Comments

feloy commented Oct 10, 2024

mukel commented Oct 10, 2024

feloy commented Oct 10, 2024

stephanj commented Oct 13, 2024

geoand commented Oct 22, 2024

stephanj commented Oct 22, 2024

geoand commented Oct 23, 2024 • edited Loading

mukel commented Oct 23, 2024

geoand commented Oct 23, 2024

geoand commented Oct 25, 2024 • edited Loading

mukel commented Oct 25, 2024

geoand commented Oct 25, 2024

geoand commented Oct 23, 2024 •

edited

Loading

geoand commented Oct 25, 2024 •

edited

Loading