llama ops ideas #50

iboB · 2024-08-21T06:31:27Z

iboB
Aug 21, 2024
Maintainer

Currently we support a single rudimentary op: run which generates up to max_tokens tokens.

Obviously this is not enough for a chat (or at least would make a chat clunky).

What other ops do we need?

pminev · 2024-12-06T17:04:40Z

pminev
Dec 6, 2024
Maintainer

Based on that issue I've compiled a list of features that are used in the server.

Llama server features

To see all supported options and server endpoints you can look at server's documentation. There is explained what is supported by the server and how to configure it for your use-case.

Feature list

I'll compile a list of llama features that are used in the server and how much we've integrated them in our plugin. As a result we'll have a list of features to implement in order to be feature comparable when we're implementing our server.

Model

Feature	Supported by AC Llama	Prio
LLM inference	YES	-
Init Model with all llama_model_params	PARTIALLY	which?
Init with KV overrides - it's in llama_model_params	NO	low
Chat templates	YES - can be improved	HIGH
Stream	NO - can be implemented by the client	HIGH
Antiprompts	YES	-
Reranking	NO	?
Parallel decoding with multi-user support	YES	?
Continuous batching	NO	?
Multimodal	NO	?
Adding LoRa	YES - it's `scaling` is considered	-
Adding Control vectors	YES - it's `scaling` & `layer range` is considered	-

Instance specific

Feature	Supported by AC Llama	Prio
Init with customized llama_context_params	PARTIAL	which?

Session specific

Feature	Supported by AC Llama	Prio
Control Sampler params (temp, top_k, top_p)	NO - there is no such session op	HIGH
Control repeat penalty	NO	^
Control DRY multiplier	NO	^
Control which samplers are enabled and their order (mirostat for example)	NO	^
Control how much last tokens to consider penalizing repeatition	NO	^
Control response indentation (for code completion)	NO	?
Control grammar-based sampling	NO	HIGH
Should insert special tokens in the prompt (BOS, EOS ...)	NO	?
Tokenize only	YES	?
Should return tokens with IDs	NO	?
Generate embeddings	YES	HIGH
Generate an infill (passed prefix and suffix - for code completion)	NO	?
Extract usage analytics (total tokens, kv cache usage)	NO	low

Server specific

Feature	Supported by AC Llama	Priority
Schema-constraned json response format	YES	-
OpenAI API compatible chat completions and embeddings routes	NO - server specific	-
Control prompt caching	NO	?
Set a response image data	NO	?
Set a response time limit	NO	?

** edited with prio notes **

0 replies

pminev · 2024-12-11T15:04:48Z

pminev
Dec 11, 2024
Maintainer

Reranking is used to sort document relevancy according to specific prompt. In llama.cpp's server you can either run inference, run only embedding or run only reranking since the models which are used are different.

OpenAI have models which can do embeddings and text-completion.

In our API this can be solved by creating new Model for each operation. Maybe we solve this at loader level by creating different Model/Instances for each type of model which supports specific operations.

0 replies

pminev · 2024-12-11T15:48:26Z

pminev
Dec 11, 2024
Maintainer

Parallel decoding with multi-user support - Translated to AC language is basically support of multiple Instances for specific model which can handle multiple users.

0 replies

pminev · 2024-12-12T10:27:09Z

pminev
Dec 12, 2024
Maintainer

Continuous batching - batch multiple server tasks. If there are multiple jobs for completion it'll try to batch them in single llama decode call.

0 replies

pminev · 2024-12-12T11:36:57Z

pminev
Dec 12, 2024
Maintainer

Multimodal - Support of multimodal models - according to this issue the support was removed and now it's WIP.

0 replies

pminev · 2024-12-12T12:24:36Z

pminev
Dec 12, 2024
Maintainer

Control response indentation - in llama.cpp server this happens after getToken operation and verify that the indentation is minimum N whitespaces. In the terms of AC this has to be param in high-level API (kind of completeText) wherever it'll be implemented.

0 replies

pminev · 2024-12-12T12:31:22Z

pminev
Dec 12, 2024
Maintainer

Should insert special tokens in the prompt - Should special tokens like BOS be inserted in the prompt during tokenization. It's passed to tokenize_mixed after that to common_tokenize which goes to llama_tokenize

0 replies

pminev · 2024-12-12T12:34:03Z

pminev
Dec 12, 2024
Maintainer

Tokenize only - The server in llama.cpp supports to make a request to give a text and to receive array of token IDs. To support it in AC we should implement such Instance op.

0 replies

pminev · 2024-12-12T12:38:14Z

pminev
Dec 12, 2024
Maintainer

Should return tokens with IDs - this was incorrectly phrased. This in llama.cpp is an option of the tokenizer with_pieces. It means instead of returning vector of IDs to return vector of objects which contain tokenID and it's corresponding text piece.

0 replies

pminev · 2024-12-12T12:40:28Z

pminev
Dec 12, 2024
Maintainer

Control prompt caching - cache_prompt option of completion request: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are not guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: true

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama ops ideas #50

{{title}}

Replies: 10 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

llama ops ideas #50

iboB Aug 21, 2024 Maintainer

Replies: 10 comments

pminev Dec 6, 2024 Maintainer

Llama server features

Feature list

Model

Instance specific

Session specific

Server specific

pminev Dec 11, 2024 Maintainer

pminev Dec 11, 2024 Maintainer

pminev Dec 12, 2024 Maintainer

pminev Dec 12, 2024 Maintainer

pminev Dec 12, 2024 Maintainer

pminev Dec 12, 2024 Maintainer

pminev Dec 12, 2024 Maintainer

pminev Dec 12, 2024 Maintainer

pminev Dec 12, 2024 Maintainer

iboB
Aug 21, 2024
Maintainer

pminev
Dec 6, 2024
Maintainer

pminev
Dec 11, 2024
Maintainer

pminev
Dec 11, 2024
Maintainer

pminev
Dec 12, 2024
Maintainer

pminev
Dec 12, 2024
Maintainer

pminev
Dec 12, 2024
Maintainer

pminev
Dec 12, 2024
Maintainer

pminev
Dec 12, 2024
Maintainer

pminev
Dec 12, 2024
Maintainer

pminev
Dec 12, 2024
Maintainer