Skip to content

Latest commit

 

History

History
209 lines (160 loc) · 6.87 KB

README.md

File metadata and controls

209 lines (160 loc) · 6.87 KB

PyLLaMACpp

License: MIT PyPi version

Python bindings for llama.cpp

For those who don't know, llama.cpp is a port of Facebook's LLaMA model in pure C/C++:

  • Without dependencies
  • Apple silicon first-class citizen - optimized via ARM NEON
  • AVX2 support for x86 architectures
  • Mixed F16 / F32 precision
  • 4-bit quantization support
  • Runs on the CPU

Table of contents

Installation

  1. The easy way is to install the prebuilt wheels
pip install pyllamacpp

However, the compilation process of llama.cpp is taking into account the architecture of the target CPU, so you might need to build it from source:

pip install git+https://github.com/abdeladim-s/pyllamacpp.git

CLI

You can run the following simple command line interface to test the package once it is installed:

pyllamacpp path/to/ggml/model
pyllamacpp -h

usage: pyllamacpp [-h] [--n_ctx N_CTX] [--n_parts N_PARTS] [--seed SEED] [--f16_kv F16_KV] [--logits_all LOGITS_ALL]
                  [--vocab_only VOCAB_ONLY] [--use_mlock USE_MLOCK] [--embedding EMBEDDING] [--n_predict N_PREDICT] [--n_threads N_THREADS]
                  [--repeat_last_n REPEAT_LAST_N] [--top_k TOP_K] [--top_p TOP_P] [--temp TEMP] [--repeat_penalty REPEAT_PENALTY]
                  [--n_batch N_BATCH]
                  model

This is like a chatbot, You can start the conversation with `Hi, can you help me ?` Pay attention though that it may hallucinate!

positional arguments:
  model                 The path of the model file

options:
  -h, --help            show this help message and exit
  --n_ctx N_CTX         text context
  --n_parts N_PARTS
  --seed SEED           RNG seed
  --f16_kv F16_KV       use fp16 for KV cache
  --logits_all LOGITS_ALL
                        the llama_eval() call computes all logits, not just the last one
  --vocab_only VOCAB_ONLY
                        only load the vocabulary, no weights
  --use_mlock USE_MLOCK
                        force system to keep model in RAM
  --embedding EMBEDDING
                        embedding mode only
  --n_predict N_PREDICT
                        Number of tokens to predict
  --n_threads N_THREADS
                        Number of threads
  --repeat_last_n REPEAT_LAST_N
                        Last n tokens to penalize
  --top_k TOP_K         top_k
  --top_p TOP_P         top_p
  --temp TEMP           temp
  --repeat_penalty REPEAT_PENALTY
                        repeat_penalty
  --n_batch N_BATCH     batch size for prompt processing

Tutorial

Quick start

A simple Pythonic API is built on top of llama.cpp C/C++ functions. You can call it from Python as follows:

from pyllamacpp.model import Model

model = Model(model_path='./models/gpt4all-model.bin')
for token in model.generate("Tell me a joke ?"):
    print(token, end='', flush=True)

Interactive Dialogue

You can set up an interactive dialogue by simply keeping the model variable alive:

from pyllamacpp.model import Model

model = Model(model_path='/path/to/ggml/model')
while True:
    try:
        prompt = input("You: ", flush=True)
        if prompt == '':
            continue
        print(f"AI:", end='')
        for token in model.generate(prompt):
            print(f"{token}", end='', flush=True)
        print()
    except KeyboardInterrupt:
        break

Attribute a persona to the language model

The following is an example showing how to "attribute a persona to the language model" :

from pyllamacpp.model import Model

prompt_context = """Act as Bob. Bob is helpful, kind, honest,
and never fails to answer the User's requests immediately and with precision. 

User: Nice to meet you Bob!
Bob: Welcome! I'm here to assist you with anything you need. What can I do for you today?
"""

prompt_prefix = "\nUser:"
prompt_suffix = "\nBob:"

model = Model(model_path='/path/to/ggml/model',
              prompt_context=prompt_context,
              prompt_prefix=prompt_prefix,
              prompt_suffix=prompt_suffix)

while True:
  try:
    prompt = input("User: ")
    if prompt == '':
      continue
    print(f"Bob: ", end='')
    for token in model.generate(prompt, antiprompt='User:'):
      print(f"{token}", end='', flush=True)
      print()
  except KeyboardInterrupt:
    break

Supported models

Fully tested with GPT4All model, see PyGPT4All.

But all models supported by llama.cpp should be supported as well:

Supported models:

Advanced usage

For advanced users, you can access the llama.cpp C-API functions directly to make your own logic. All functions from llama.h are exposed with the binding module _pyllamacpp.

API reference

You can check the API reference documentation for more details.

FAQs

Discussions and contributions

If you find any bug, please open an issue.

If you have any feedback, or you want to share how you are using this project, feel free to use the Discussions and open a new topic.

License

This project is licensed under the same license as llama.cpp (MIT License).