This is a wrapper of llama.cpp implemented as per the discussion Integration of llama.cpp and whisper.cpp:
- Use the llama.cpp C interface in llama.h
- Reimplement the common library
As mentioned in the discussion the (maybe distant) future plan is to ditch llama.cpp by reimplementing entirely with vanilla ggml and a C++ interface.
Important
When cloning this repo, don't forget to fetch the submodules.
- Either:
$ git clone https://github.com/alpaca-core/ac-local.git --recurse-submodules
- Or:
$ git clone https://github.com/alpaca-core/ac-local.git
$ cd ac-local
$ git submodule update --init --recursive
- Better error handling, please
- GGUF metadata access (
llama_model_meta_*
) is not great. We should provide a better interface llama_chat_apply_template
does not handle memory allocation optimally. There's a lot of room for improvement- as a whole, chat management is not very efficient.
llama_chat_format_single
doing a full chat format for a single message is terrible
- as a whole, chat management is not very efficient.
- Chat templates can't be used to escape special tokens. If the user actually enters some, this just messes-up the resulting formatted text.
- Give vocab more visibility
- Token-to-text can be handled much more elegantly by using plain ol'
string_view
instead of copying strings. It's not like tokens are going to be modified once the model is loaded- If we don't reimplement, perhaps keeping a parallel array of all tokens to string would be a good idea
llama_batch
being used for both input and output makes it hard to propagate the constness of the input buffer. This leads to code having to use non-const buffers, even if we know they're not going to be modified. We should bind the buffer constness to the batch struct itself.- The low-level llama context currently takes a rng seed (which is only used for mirostat sampling). A reimplemented context should be deterministic. If an operation requires random numbers, a generator should be provided from the outside.
- For now we will hide the mirostat sampling altogether and ditch the seed
- As per this discussion we should take into account how we want to deal with asset storage and whether we want to abstract the i/o away.