vLLM Integration

You can use vLLM as an optimized worker implementation in FastChat. It offers advanced continuous batching and a much higher (~10x) throughput. See the supported models here.

Instructions

Install vLLM.
```
pip install vllm
```
When you launch a model worker, replace the normal worker (fastchat.serve.model_worker) with the vLLM worker (fastchat.serve.vllm_worker). All other commands such as controller, gradio web server, and OpenAI API server are kept the same.
```
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3
```
If you see tokenizer errors, try
```
python3 -m fastchat.serve.vllm_worker --model-path lmsys/vicuna-7b-v1.3 --tokenizer hf-internal-testing/llama-tokenizer
```
if you use a awq model, try ''' python3 -m fastchat.serve.vllm_worker --model-path TheBloke/vicuna-7B-v1.5-AWQ --quantization awq '''

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm_integration.md

vllm_integration.md

vLLM Integration

Instructions

Files

vllm_integration.md

Latest commit

History

vllm_integration.md

File metadata and controls

vLLM Integration

Instructions