Skip to content

Commit

Permalink
Fix docs, update command
Browse files Browse the repository at this point in the history
Signed-off-by: Rafael Vasquez <[email protected]>
  • Loading branch information
rafvasq committed Jan 9, 2025
1 parent 6f638e9 commit ce45c0d
Show file tree
Hide file tree
Showing 43 changed files with 346 additions and 332 deletions.
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,5 @@ make html
```bash
python -m http.server -d build/html/
```

Launch your browser and open localhost:8000.
1 change: 0 additions & 1 deletion docs/source/api/multimodal/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ via the `multi_modal_data` field in {class}`vllm.inputs.PromptType`.

Looking to add your own multi-modal model? Please follow the instructions listed [here](#enabling-multimodal-inputs).


## Module Contents

```{eval-rst}
Expand Down
1 change: 0 additions & 1 deletion docs/source/api/params.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,3 @@ Optional parameters for vLLM APIs.
.. autoclass:: vllm.PoolingParams
:members:
```

2 changes: 2 additions & 0 deletions docs/source/community/sponsors.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,15 @@ vLLM is a community project. Our compute resources for development and testing a
<!-- Note: Please keep these consistent with README.md. -->

Cash Donations:

- a16z
- Dropbox
- Sequoia Capital
- Skywork AI
- ZhenFund

Compute Resources:

- AMD
- Anyscale
- AWS
Expand Down
2 changes: 0 additions & 2 deletions docs/source/contributing/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,6 @@ pytest tests/
Currently, the repository is not fully checked by `mypy`.
```

# Contribution Guidelines

## Issues

If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/deployment/docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ memory to share data between processes under the hood, particularly for tensor p
You can build and run vLLM from source via the provided <gh-file:Dockerfile>. To build vLLM:

```console
$ # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
$ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
```

```{note}
Expand Down
10 changes: 5 additions & 5 deletions docs/source/deployment/frameworks/cerebrium.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr
To install the Cerebrium client, run:

```console
$ pip install cerebrium
$ cerebrium login
pip install cerebrium
cerebrium login
```

Next, create your Cerebrium project, run:

```console
$ cerebrium init vllm-project
cerebrium init vllm-project
```

Next, to install the required packages, add the following to your cerebrium.toml:
Expand Down Expand Up @@ -58,10 +58,10 @@ def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
Then, run the following code to deploy it to the cloud:

```console
$ cerebrium deploy
cerebrium deploy
```

If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case` /run`)
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)

```python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
Expand Down
10 changes: 5 additions & 5 deletions docs/source/deployment/frameworks/dstack.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),
To install dstack client, run:

```console
$ pip install "dstack[all]
$ dstack server
pip install "dstack[all]
dstack server
```

Next, to configure your dstack project, run:

```console
$ mkdir -p vllm-dstack
$ cd vllm-dstack
$ dstack init
mkdir -p vllm-dstack
cd vllm-dstack
dstack init
```

Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/deployment/frameworks/skypilot.md
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@ run: |
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
```

2. Then, we can access the GUI at the returned gradio link:
1. Then, we can access the GUI at the returned gradio link:

```console
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
Expand Down
2 changes: 1 addition & 1 deletion docs/source/deployment/integrations/llamastack.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-sta
To install Llama Stack, run

```console
$ pip install llama-stack -q
pip install llama-stack -q
```

## Inference using OpenAI Compatible API
Expand Down
9 changes: 5 additions & 4 deletions docs/source/deployment/k8s.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Before you begin, ensure that you have the following:

## Deployment Steps

1. **Create a PVC , Secret and Deployment for vLLM**
1. Create a PVC, Secret and Deployment for vLLM

PVC is used to store the model cache and it is optional, you can use hostPath or other storage options

Expand Down Expand Up @@ -49,7 +49,7 @@ stringData:
Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.

Here are two examples for using NVIDIA GPU and AMD GPU.
Here are two examples for using NVIDIA GPU and AMD GPU.

- NVIDIA GPU

Expand Down Expand Up @@ -194,9 +194,10 @@ spec:
- name: shm
mountPath: /dev/shm
```

You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.

2. **Create a Kubernetes Service for vLLM**
1. Create a Kubernetes Service for vLLM

Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:

Expand All @@ -219,7 +220,7 @@ spec:
type: ClusterIP
```

3. **Deploy and Test**
1. Deploy and Test

Apply the deployment and service configurations using `kubectl apply -f <filename>`:

Expand Down
11 changes: 4 additions & 7 deletions docs/source/design/automatic_prefix_caching.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,24 @@ The core idea of [PagedAttention](#design-paged-attention) is to partition the K

To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.

```
```text
Block 1 Block 2 Block 3
[A gentle breeze stirred] [the leaves as children] [laughed in the distance]
Block 1: |<--- block tokens ---->|
Block 2: |<------- prefix ------>| |<--- block tokens --->|
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
```


In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:

```
```text
hash(prefix tokens + block tokens) <--> KV Block
```

With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space.


This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.


## Generalized Caching Policy

Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
Expand All @@ -41,5 +38,5 @@ Note that this eviction policy effectively implements the exact policy as in [Ra

However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above:

- Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
- Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
* Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
* Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
4 changes: 2 additions & 2 deletions docs/source/features/quantization/auto_awq.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the [400+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq).

```console
$ pip install autoawq
pip install autoawq
```

After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
Expand Down Expand Up @@ -47,7 +47,7 @@ print(f'Model is quantized and saved at "{quant_path}"')
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:

```console
$ python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
```

AWQ models are also supported directly through the LLM entrypoint:
Expand Down
7 changes: 4 additions & 3 deletions docs/source/features/quantization/bnb.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
Below are the steps to utilize BitsAndBytes with vLLM.

```console
$ pip install bitsandbytes>=0.45.0
pip install bitsandbytes>=0.45.0
```

vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.

You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>.
And usually, these repositories have a config.json file that includes a quantization_config section.

## Read quantized checkpoint.
## Read quantized checkpoint

```python
from vllm import LLM
Expand All @@ -37,10 +37,11 @@ model_id = "huggyllama/llama-7b"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes", load_format="bitsandbytes")
```

## OpenAI Compatible Server

Append the following to your 4bit model arguments:

```
```console
--quantization bitsandbytes --load-format bitsandbytes
```
4 changes: 2 additions & 2 deletions docs/source/features/quantization/fp8.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Currently, we load the model at original precision before quantizing down to 8-b
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:

```console
$ pip install llmcompressor
pip install llmcompressor
```

## Quantization Process
Expand Down Expand Up @@ -98,7 +98,7 @@ tokenizer.save_pretrained(SAVE_DIR)
Install `vllm` and `lm-evaluation-harness`:

```console
$ pip install vllm lm-eval==0.4.4
pip install vllm lm-eval==0.4.4
```

Load and run the model in `vllm`:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/features/quantization/fp8_e4m3_kvcache.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO).
To install AMMO (AlgorithMic Model Optimization):

```console
$ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
```

Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon
Expand Down
10 changes: 5 additions & 5 deletions docs/source/features/quantization/gguf.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@ Currently, vllm only supports loading single-file GGUF models. If you have a mul
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:

```console
$ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
```

You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:

```console
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
```

```{warning}
Expand Down
2 changes: 1 addition & 1 deletion docs/source/features/quantization/int8.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turi
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:

```console
$ pip install llmcompressor
pip install llmcompressor
```

## Quantization Process
Expand Down
10 changes: 2 additions & 8 deletions docs/source/features/spec_decode.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,11 +192,11 @@ A few important things to consider when using the EAGLE based draft models:

1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be
used directly with vLLM due to differences in the expected layer names and model definition.
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
to convert them. Note that this script does not modify the model's weights.

In the above example, use the script to first convert
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
and then use the converted checkpoint as the draft model in vLLM.

2. The EAGLE based draft models need to be run without tensor parallelism
Expand All @@ -207,7 +207,6 @@ A few important things to consider when using the EAGLE based draft models:
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).


A variety of EAGLE draft models are available on the Hugging Face hub:

| Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
Expand All @@ -224,7 +223,6 @@ A variety of EAGLE draft models are available on the Hugging Face hub:
| Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
| Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |


## Lossless guarantees of Speculative Decoding

In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
Expand All @@ -250,17 +248,13 @@ speculative decoding, breaking down the guarantees into three key areas:
same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).

**Conclusion**

While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors:

- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability.

**Mitigation Strategies**

For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).

## Resources for vLLM contributors
Expand Down
Loading

0 comments on commit ce45c0d

Please sign in to comment.