Fix docs, update command

Signed-off-by: Rafael Vasquez <[email protected]>
vllm-project · Jan 9, 2025 · ce45c0d · ce45c0d
1 parent 6f638e9
commit ce45c0d
Show file tree

Hide file tree

Showing 43 changed files with 346 additions and 332 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -16,4 +16,5 @@ make html
 ```bash
 python -m http.server -d build/html/
 ```
+
 Launch your browser and open localhost:8000.
diff --git a/docs/source/api/multimodal/index.md b/docs/source/api/multimodal/index.md
@@ -13,7 +13,6 @@ via the `multi_modal_data` field in {class}`vllm.inputs.PromptType`.
 
 Looking to add your own multi-modal model? Please follow the instructions listed [here](#enabling-multimodal-inputs).
 
-
 ## Module Contents
 
 ```{eval-rst}

diff --git a/docs/source/api/params.md b/docs/source/api/params.md
@@ -19,4 +19,3 @@ Optional parameters for vLLM APIs.
 .. autoclass:: vllm.PoolingParams
     :members:
 ```
-
diff --git a/docs/source/community/sponsors.md b/docs/source/community/sponsors.md
@@ -6,13 +6,15 @@ vLLM is a community project. Our compute resources for development and testing a
 <!-- Note: Please keep these consistent with README.md. -->
 
 Cash Donations:
+
 - a16z
 - Dropbox
 - Sequoia Capital
 - Skywork AI
 - ZhenFund
 
 Compute Resources:
+
 - AMD
 - Anyscale
 - AWS

diff --git a/docs/source/contributing/overview.md b/docs/source/contributing/overview.md
@@ -37,8 +37,6 @@ pytest tests/
 Currently, the repository is not fully checked by `mypy`.
 ```
 
-# Contribution Guidelines
-
 ## Issues
 
 If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.

diff --git a/docs/source/deployment/docker.md b/docs/source/deployment/docker.md
@@ -28,8 +28,8 @@ memory to share data between processes under the hood, particularly for tensor p
 You can build and run vLLM from source via the provided <gh-file:Dockerfile>. To build vLLM:
 
 ```console
-$ # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
-$ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
+# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
+DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
 ```
 
 ```{note}

diff --git a/docs/source/deployment/frameworks/cerebrium.md b/docs/source/deployment/frameworks/cerebrium.md
@@ -13,14 +13,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr
 To install the Cerebrium client, run:
 
 ```console
-$ pip install cerebrium
-$ cerebrium login
+pip install cerebrium
+cerebrium login
 ```
 
 Next, create your Cerebrium project, run:
 
 ```console
-$ cerebrium init vllm-project
+cerebrium init vllm-project
 ```
 
 Next, to install the required packages, add the following to your cerebrium.toml:
@@ -58,10 +58,10 @@ def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
 Then, run the following code to deploy it to the cloud:
 
 ```console
-$ cerebrium deploy
+cerebrium deploy
 ```
 
-If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case` /run`)
+If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
 
 ```python
 curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \

diff --git a/docs/source/deployment/frameworks/dstack.md b/docs/source/deployment/frameworks/dstack.md
@@ -13,16 +13,16 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),
 To install dstack client, run:
 
 ```console
-$ pip install "dstack[all]
-$ dstack server
+pip install "dstack[all]
+dstack server
 ```
 
 Next, to configure your dstack project, run:
 
 ```console
-$ mkdir -p vllm-dstack
-$ cd vllm-dstack
-$ dstack init
+mkdir -p vllm-dstack
+cd vllm-dstack
+dstack init
 ```
 
 Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:

diff --git a/docs/source/deployment/frameworks/skypilot.md b/docs/source/deployment/frameworks/skypilot.md
@@ -338,7 +338,7 @@ run: |
 sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
 ```
 
-2. Then, we can access the GUI at the returned gradio link:
+1. Then, we can access the GUI at the returned gradio link:
 
 ```console
 | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live

diff --git a/docs/source/deployment/integrations/llamastack.md b/docs/source/deployment/integrations/llamastack.md
@@ -7,7 +7,7 @@ vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-sta
 To install Llama Stack, run
 
 ```console
-$ pip install llama-stack -q
+pip install llama-stack -q
 ```
 
 ## Inference using OpenAI Compatible API

diff --git a/docs/source/deployment/k8s.md b/docs/source/deployment/k8s.md
@@ -14,7 +14,7 @@ Before you begin, ensure that you have the following:
 
 ## Deployment Steps
 
-1. **Create a PVC , Secret and Deployment for vLLM**
+1. Create a PVC, Secret and Deployment for vLLM
 
 PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
 
@@ -49,7 +49,7 @@ stringData:
 
 Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.
 
-Here are two examples for using NVIDIA GPU and AMD GPU. 
+Here are two examples for using NVIDIA GPU and AMD GPU.
 
 - NVIDIA GPU
 
@@ -194,9 +194,10 @@ spec:
         - name: shm
           mountPath: /dev/shm
 ```
+
 You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
 
-2. **Create a Kubernetes Service for vLLM**
+1. Create a Kubernetes Service for vLLM
 
 Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
 
@@ -219,7 +220,7 @@ spec:
   type: ClusterIP
 ```
 
-3. **Deploy and Test**
+1. Deploy and Test
 
 Apply the deployment and service configurations using `kubectl apply -f <filename>`:
 

diff --git a/docs/source/design/automatic_prefix_caching.md b/docs/source/design/automatic_prefix_caching.md
@@ -6,27 +6,24 @@ The core idea of [PagedAttention](#design-paged-attention) is to partition the K
 
 To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.
 
-```
+```text
                     Block 1                  Block 2                  Block 3
          [A gentle breeze stirred] [the leaves as children] [laughed in the distance]
 Block 1: |<--- block tokens ---->|
 Block 2: |<------- prefix ------>| |<--- block tokens --->|
 Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
 ```
 
-
 In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:
 
-```
+```text
 hash(prefix tokens + block tokens) <--> KV Block
 ```
 
 With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space.
 
-
 This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.
 
-
 ## Generalized Caching Policy
 
 Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
@@ -41,5 +38,5 @@ Note that this eviction policy effectively implements the exact policy as in [Ra
 
 However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above:
 
-- Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
-- Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
+* Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
+* Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
diff --git a/docs/source/features/quantization/auto_awq.md b/docs/source/features/quantization/auto_awq.md
@@ -15,7 +15,7 @@ The main benefits are lower latency and memory usage.
 You can quantize your own models by installing AutoAWQ or picking one of the [400+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq).
 
 ```console
-$ pip install autoawq
+pip install autoawq
 ```
 
 After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
@@ -47,7 +47,7 @@ print(f'Model is quantized and saved at "{quant_path}"')
 To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
 
 ```console
-$ python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
+python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
 ```
 
 AWQ models are also supported directly through the LLM entrypoint:

diff --git a/docs/source/features/quantization/bnb.md b/docs/source/features/quantization/bnb.md
@@ -9,15 +9,15 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
 Below are the steps to utilize BitsAndBytes with vLLM.
 
 ```console
-$ pip install bitsandbytes>=0.45.0
+pip install bitsandbytes>=0.45.0
 ```
 
 vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
 
 You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>.
 And usually, these repositories have a config.json file that includes a quantization_config section.
 
-## Read quantized checkpoint.
+## Read quantized checkpoint
 
 ```python
 from vllm import LLM
@@ -37,10 +37,11 @@ model_id = "huggyllama/llama-7b"
 llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
 quantization="bitsandbytes", load_format="bitsandbytes")
 ```
+
 ## OpenAI Compatible Server
 
 Append the following to your 4bit model arguments:
 
-```
+```console
 --quantization bitsandbytes --load-format bitsandbytes
 ```
diff --git a/docs/source/features/quantization/fp8.md b/docs/source/features/quantization/fp8.md
@@ -41,7 +41,7 @@ Currently, we load the model at original precision before quantizing down to 8-b
 To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
 
 ```console
-$ pip install llmcompressor
+pip install llmcompressor
 ```
 
 ## Quantization Process
@@ -98,7 +98,7 @@ tokenizer.save_pretrained(SAVE_DIR)
 Install `vllm` and `lm-evaluation-harness`:
 
 ```console
-$ pip install vllm lm-eval==0.4.4
+pip install vllm lm-eval==0.4.4
 ```
 
 Load and run the model in `vllm`:

diff --git a/docs/source/features/quantization/fp8_e4m3_kvcache.md b/docs/source/features/quantization/fp8_e4m3_kvcache.md
@@ -17,7 +17,7 @@ unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO).
 To install AMMO (AlgorithMic Model Optimization):
 
 ```console
-$ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
+pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
 ```
 
 Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon

diff --git a/docs/source/features/quantization/gguf.md b/docs/source/features/quantization/gguf.md
@@ -13,16 +13,16 @@ Currently, vllm only supports loading single-file GGUF models. If you have a mul
 To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
 
 ```console
-$ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
-$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
-$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
+wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
+# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
+vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
 ```
 
 You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
 
 ```console
-$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
-$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
+# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
+vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
 ```
 
 ```{warning}

diff --git a/docs/source/features/quantization/int8.md b/docs/source/features/quantization/int8.md
@@ -16,7 +16,7 @@ INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turi
 To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
 
 ```console
-$ pip install llmcompressor
+pip install llmcompressor
 ```
 
 ## Quantization Process

diff --git a/docs/source/features/spec_decode.md b/docs/source/features/spec_decode.md
@@ -192,11 +192,11 @@ A few important things to consider when using the EAGLE based draft models:
 
 1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be
    used directly with vLLM due to differences in the expected layer names and model definition.
-   To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) 
+   To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
    to convert them. Note that this script does not modify the model's weights.
 
    In the above example, use the script to first convert
-   the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model 
+   the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
    and then use the converted checkpoint as the draft model in vLLM.
 
 2. The EAGLE based draft models need to be run without tensor parallelism
@@ -207,7 +207,6 @@ A few important things to consider when using the EAGLE based draft models:
    reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
    investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
 
-
 A variety of EAGLE draft models are available on the Hugging Face hub:
 
 | Base Model                                                           | EAGLE on Hugging Face                     | # EAGLE Parameters |
@@ -224,7 +223,6 @@ A variety of EAGLE draft models are available on the Hugging Face hub:
 | Qwen2-7B-Instruct                                                    | yuhuili/EAGLE-Qwen2-7B-Instruct          | 0.26B              |
 | Qwen2-72B-Instruct                                                   | yuhuili/EAGLE-Qwen2-72B-Instruct         | 1.05B              |
 
-
 ## Lossless guarantees of Speculative Decoding
 
 In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
@@ -250,17 +248,13 @@ speculative decoding, breaking down the guarantees into three key areas:
    same request across runs. For more details, see the FAQ section
    titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
 
-**Conclusion**
-
 While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
 can occur due to following factors:
 
 - **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
 - **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
   due to non-deterministic behavior in batched operations or numerical instability.
 
-**Mitigation Strategies**
-
 For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
 
 ## Resources for vLLM contributors
-Original file line number
+Diff line change
@@ Expand Up / @@ -16,4 +16,5 @@ make html @@
     ```bash
     python -m http.server -d build/html/
     ```
     Launch your browser and open localhost:8000.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -13,7 +13,6 @@ via the `multi_modal_data` field in {class}`vllm.inputs.PromptType`.

		Looking to add your own multi-modal model? Please follow the instructions listed [here](#enabling-multimodal-inputs).


		## Module Contents

		```{eval-rst}
Expand Down
Original file line number	Diff line number	Diff line change
Expand Up		@@ -19,4 +19,3 @@ Optional parameters for vLLM APIs.
		.. autoclass:: vllm.PoolingParams
		:members:
		```