Running Llama 2 with gradio web UI on GPU or CPU from anywhere (Linux/Windows/Mac).
- Supporting all Llama 2 models (7B, 13B, 70B, GPTQ, GGML) with 8-bit, 4-bit mode.
- Supporting GPU inference with at least 6 GB VRAM, and CPU inference.
-
Supporting models: Llama-2-7b/13b/70b, all Llama-2-GPTQ, all Llama-2-GGML ...
-
Supporting model backends
-
Nvidia GPU: tranformers, bitsandbytes(8-bit inference), AutoGPTQ(4-bit inference)
- GPU inference with at least 6 GB VRAM
-
CPU, Mac/AMD GPU: llama.cpp
- CPU inference Demo on Macbook Air.
-
-
Web UI interface: gradio
Method 1: From PyPI
pip install llama2-wrapper
git clone https://github.com/liltom-eth/llama2-webui.git
cd llama2-webui
pip install -r requirements.txt
bitsandbytes >= 0.39
may not work on older NVIDIA GPUs. In that case, to use LOAD_IN_8BIT
, you may have to downgrade like this:
pip install bitsandbytes==0.38.1
bitsandbytes
also need a special install for Windows:
pip uninstall bitsandbytes
pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.0-py3-none-win_amd64.whl
Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.
Llama-2-7b-Chat-GPTQ is the GPTQ model files for Meta's Llama 2 7b Chat. GPTQ 4-bit Llama-2 model require less GPU VRAM to run it.
Model Name | set MODEL_PATH in .env | Download URL |
---|---|---|
meta-llama/Llama-2-7b-chat-hf | /path-to/Llama-2-7b-chat-hf | Link |
meta-llama/Llama-2-13b-chat-hf | /path-to/Llama-2-13b-chat-hf | Link |
meta-llama/Llama-2-70b-chat-hf | /path-to/Llama-2-70b-chat-hf | Link |
meta-llama/Llama-2-7b-hf | /path-to/Llama-2-7b-hf | Link |
meta-llama/Llama-2-13b-hf | /path-to/Llama-2-13b-hf | Link |
meta-llama/Llama-2-70b-hf | /path-to/Llama-2-70b-hf | Link |
TheBloke/Llama-2-7b-Chat-GPTQ | /path-to/Llama-2-7b-Chat-GPTQ | Link |
TheBloke/Llama-2-7B-Chat-GGML | /path-to/llama-2-7b-chat.ggmlv3.q4_0.bin | Link |
... | ... | ... |
Running 4-bit model Llama-2-7b-Chat-GPTQ
needs GPU with 6GB VRAM.
Running 4-bit model llama-2-7b-chat.ggmlv3.q4_0.bin
needs CPU with 6GB RAM. There is also a list of other 2, 3, 4, 5, 6, 8-bit GGML models that can be used from TheBloke/Llama-2-7B-Chat-GGML.
These models can be downloaded from the link using CMD like:
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone [email protected]:meta-llama/Llama-2-7b-chat-hf
To download Llama 2 models, you need to request access from https://ai.meta.com/llama/ and also enable access on repos like meta-llama/Llama-2-7b-chat-hf. Requests will be processed in hours.
For GPTQ models like TheBloke/Llama-2-7b-Chat-GPTQ, you can directly download without requesting access.
For GGML models like TheBloke/Llama-2-7B-Chat-GGML, you can directly download without requesting access.
Setup your MODEL_PATH
and model configs in .env
file.
There are some examples in ./env_examples/
folder.
Model Setup | Example .env |
---|---|
Llama-2-7b-chat-hf 8-bit on GPU | .env.7b_8bit_example |
Llama-2-7b-Chat-GPTQ 4-bit on GPU | .env.7b_gptq_example |
Llama-2-7B-Chat-GGML 4bit on CPU | .env.7b_ggmlv3_q4_0_example |
Llama-2-13b-chat-hf on GPU | .env.13b_example |
... | ... |
Run chatbot with web UI:
python app.py
The running requires around 14GB of GPU VRAM for Llama-2-7b and 28GB of GPU VRAM for Llama-2-13b.
If you are running on multiple GPUs, the model will be loaded automatically on GPUs and split the VRAM usage. That allows you to run Llama-2-7b (requires 14GB of GPU VRAM) on a setup like 2 GPUs (11GB VRAM each).
If you do not have enough memory, you can set up your LOAD_IN_8BIT
as True
in .env
. This can reduce memory usage by around half with slightly degraded model quality. It is compatible with the CPU, GPU, and Metal backend.
Llama-2-7b with 8-bit compression can run on a single GPU with 8 GB of VRAM, like an Nvidia RTX 2080Ti, RTX 4080, T4, V100 (16GB).
If you want to run 4 bit Llama-2 model like Llama-2-7b-Chat-GPTQ
, you can set up your LOAD_IN_4BIT
as True
in .env
like example .env.7b_gptq_example
.
Make sure you have downloaded the 4-bit model from Llama-2-7b-Chat-GPTQ
and set the MODEL_PATH
and arguments in .env
file.
Llama-2-7b-Chat-GPTQ
can run on a single GPU with 6 GB of VRAM.
If you encounter issue like NameError: name 'autogptq_cuda_256' is not defined
, please refer to here
Run Llama-2 model on CPU requires llama.cpp dependency and llama.cpp Python Bindings, which are already installed.
Download GGML models like llama-2-7b-chat.ggmlv3.q4_0.bin
following Download Llama-2 Models section. llama-2-7b-chat.ggmlv3.q4_0.bin
model requires at least 6 GB RAM to run on CPU.
Set up configs like .env.7b_ggmlv3_q4_0_example
from env_examples
as .env
.
Run web UI python app.py
.
If you would like to use Mac Metal for acceleration,
pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
pip install 'llama-cpp-python[server]'
or check details:
If you would like to use AMD/Nvidia GPU for acceleration, check this:
Run benchmark script to compute performance on your device:
python benchmark.py
You can also select the number of times the benchmark will be run :
python benchmark.py --iter NB_OF_ITERATIONS
By default, the number of iterations is 5, but if you want a faster result or a more accurate one you can set it to whatever value you want, but please only report results with at least 5 iterations.
benchmark.py
will load the same .env
as app.py
.
Some benchmark performance:
Model | Precision | Device | GPU VRAM | Speed (tokens/sec) | load time (s) |
---|---|---|---|---|---|
Llama-2-7b-chat-hf | 8bit | NVIDIA RTX 2080 Ti | 7.7 GB VRAM | 3.76 | 783.87 |
Llama-2-7b-Chat-GPTQ | 4 bit | NVIDIA RTX 2080 Ti | 5.8 GB VRAM | 12.08 | 192.91 |
llama-2-7b-chat.ggmlv3.q4_0 | 4 bit | Apple M2 CPU | 5.4 GB RAM | 5.28 | 0.20 |
llama-2-7b-chat.ggmlv3.q4_0 | 4 bit | Apple M2 Metal | 5.4 GB RAM | 9.56 | 0.47 |
llama-2-7b-chat.ggmlv3.q2_K | 2 bit | Intel i7-8700 | 4.5 GB RAM | 5.70 | 71.48 |
Check/contribute the performance of your device in the full performance doc.
MIT - see MIT License
This project enables users to adapt it freely for proprietary purposes without any restrictions.
Kindly read our Contributing Guide to learn and understand our development process.