A LeapfrogAI API-compatible vLLM wrapper for quantized and un-quantized model inferencing across GPU infrastructures.
See instructions to get the backend up and running. Then, use the LeapfrogAI API server to interact with the backend.
The instructions in this section assume the following:
- Properly installed and configured Python 3.11.x, to include its development tools
- The LeapfrogAI API server is deployed and running
The following are additional assumptions for GPU inferencing:
- You have properly installed one or more NVIDIA GPUs and GPU drivers
- You have properly installed and configured the cuda-toolkit and nvidia-container-toolkit
The default model that comes with this backend in this repository's officially released images is a 4-bit quantization of the Synthia-7b model.
You can optionally specify different models or quantization types using the following Docker build arguments:
--build-arg HF_HUB_ENABLE_HF_TRANSFER="1"
: Enable or disable HuggingFace Hub transfer (default: 1)--build-arg REPO_ID="TheBloke/Synthia-7B-v2.0-GPTQ"
: HuggingFace repository ID for the model--build-arg REVISION="gptq-4bit-32g-actorder_True"
: Revision or commit hash for the model--build-arg QUANTIZATION="gptq"
: Quantization type (e.g., gptq, awq, or empty for un-quantized)--build-arg TENSOR_PARALLEL_SIZE="1"
: The number of gpus to spread the tensor processing across
To build and deploy just the VLLM Zarf package (from the root of the repository):
Deploy a UDS cluster if one isn't deployed already
pip install 'huggingface_hub[cli,hf_transfer]' # Used to download the model weights from huggingface
make build-vllm LOCAL_VERSION=dev
uds zarf package deploy packages/vllm/zarf-package-vllm-*-dev.tar.zst --confirm
To run the vllm backend locally (starting from the root directory of the repository):
# Setup Virtual Environment if you haven't done so already
python -m venv .venv
source .venv/bin/activate
# Install dependencies
python -m pip install src/leapfrogai_sdk
cd packages/vllm
# To support Huggingface Hub model downloads
python -m pip install ".[dev]"
# Copy the environment variable file, change this if different params are needed
cp .env.example .env
# Make sure environment variables are set
source .env
# Clone Model
# Supply a REPO_ID, FILENAME and REVISION if a different model is desired
python src/model_download.py
mv .model/*.gguf .model/model.gguf
# Start Model Backend
lfai-cli --app-dir=src/ main:Model