vllm-project · DarkLight1337 · Dec 31, 2024 · Dec 31, 2024 · Dec 31, 2024 · Dec 31, 2024
diff --git a/docs/source/design/arch_overview.md b/docs/source/design/arch_overview.md
@@ -77,8 +77,7 @@ python -m vllm.entrypoints.openai.api_server --model <model>
 
 That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.
 
-More details on the API server can be found in the {doc}`OpenAI Compatible
-Server </serving/openai_compatible_server>` document.
+More details on the API server can be found in the [OpenAI-Compatible Server](#openai-compatible-server) document.
 
 ## LLM Engine
 

diff --git a/docs/source/design/multiprocessing.md b/docs/source/design/multiprocessing.md
@@ -2,7 +2,7 @@
 
 ## Debugging
 
-Please see the [Debugging Tips](#debugging-python-multiprocessing)
+Please see the [Troubleshooting](#troubleshooting-python-multiprocessing)
 page for information on known issues and how to solve them.
 
 ## Introduction

diff --git a/docs/source/usage/faq.md → docs/source/getting_started/faq.md b/docs/source/usage/faq.md → docs/source/getting_started/faq.md
diff --git a/...ource/getting_started/arm-installation.md → ...e/getting_started/installation/cpu-arm.md b/...ource/getting_started/arm-installation.md → ...e/getting_started/installation/cpu-arm.md
@@ -2,7 +2,7 @@
 
 # Installation for ARM CPUs
 
-vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. This guide provides installation instructions specific to ARM. For additional details on supported features, refer to the x86 platform documentation covering:
+vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. This guide provides installation instructions specific to ARM. For additional details on supported features, refer to the [x86 CPU documentation](#installation-x86) covering:
 
 - CPU backend inference capabilities
 - Relevant runtime environment variables

diff --git a/...ource/getting_started/cpu-installation.md → ...e/getting_started/installation/cpu-x86.md b/...ource/getting_started/cpu-installation.md → ...e/getting_started/installation/cpu-x86.md
@@ -1,6 +1,6 @@
-(installation-cpu)=
+(installation-x86)=
 
-# Installation with CPU
+# Installation for x86 CPUs
 
 vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM CPU backend supports the following vLLM features:
 
@@ -151,4 +151,4 @@ $ python examples/offline_inference.py
     $ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
     ```
 
-  - Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](../serving/deploying_with_nginx.md) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
+  - Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](#nginxloadbalancer) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
diff --git a/docs/source/getting_started/installation.md → .../getting_started/installation/gpu-cuda.md b/docs/source/getting_started/installation.md → .../getting_started/installation/gpu-cuda.md
@@ -1,6 +1,6 @@
-(installation)=
+(installation-cuda)=
 
-# Installation
+# Installation for CUDA
 
 vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.
 

diff --git a/...ource/getting_started/amd-installation.md → .../getting_started/installation/gpu-rocm.md b/...ource/getting_started/amd-installation.md → .../getting_started/installation/gpu-rocm.md
@@ -1,6 +1,6 @@
 (installation-rocm)=
 
-# Installation with ROCm
+# Installation for ROCm
 
 vLLM supports AMD GPUs with ROCm 6.2.
 

diff --git a/...rce/getting_started/gaudi-installation.md → ...getting_started/installation/hpu-gaudi.md b/...rce/getting_started/gaudi-installation.md → ...getting_started/installation/hpu-gaudi.md
@@ -1,4 +1,6 @@
-# Installation with Intel® Gaudi® AI Accelerators
+(installation-gaudi)=
+
+# Installation for Intel® Gaudi®
 
 This README provides instructions on running vLLM with Intel Gaudi devices.
 

diff --git a/docs/source/getting_started/installation/index.md b/docs/source/getting_started/installation/index.md
@@ -0,0 +1,19 @@
+(installation-index)=
+
+# Installation
+
+vLLM supports the following hardware platforms:
+
+```{toctree}
+:maxdepth: 1
+
+gpu-cuda
+gpu-rocm
+cpu-x86
+cpu-arm
+hpu-gaudi
+tpu
+xpu
+openvino
+neuron
+```
diff --git a/...ce/getting_started/neuron-installation.md → ...ce/getting_started/installation/neuron.md b/...ce/getting_started/neuron-installation.md → ...ce/getting_started/installation/neuron.md
@@ -1,6 +1,6 @@
 (installation-neuron)=
 
-# Installation with Neuron
+# Installation for Neuron
 
 vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching.
 Paged Attention and Chunked Prefill are currently in development and will be available soon.

diff --git a/.../getting_started/openvino-installation.md → .../getting_started/installation/openvino.md b/.../getting_started/openvino-installation.md → .../getting_started/installation/openvino.md
@@ -1,8 +1,8 @@
 (installation-openvino)=
 
-# Installation with OpenVINO
+# Installation for OpenVINO
 
-vLLM powered by OpenVINO supports all LLM models from {doc}`vLLM supported models list <../models/supported_models>` and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features:
+vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](#supported-models) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features:
 
 - Prefix caching (`--enable-prefix-caching`)
 - Chunked prefill (`--enable-chunked-prefill`)

diff --git a/...ource/getting_started/tpu-installation.md → ...ource/getting_started/installation/tpu.md b/...ource/getting_started/tpu-installation.md → ...ource/getting_started/installation/tpu.md
@@ -1,6 +1,6 @@
 (installation-tpu)=
 
-# Installation with TPU
+# Installation for TPUs
 
 Tensor Processing Units (TPUs) are Google's custom-developed application-specific
 integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs

diff --git a/...ource/getting_started/xpu-installation.md → ...ource/getting_started/installation/xpu.md b/...ource/getting_started/xpu-installation.md → ...ource/getting_started/installation/xpu.md
@@ -1,6 +1,6 @@
 (installation-xpu)=
 
-# Installation with XPU
+# Installation for XPUs
 
 vLLM initially supports basic model inferencing and serving on Intel GPU platform.
 

diff --git a/docs/source/getting_started/quickstart.md b/docs/source/getting_started/quickstart.md
@@ -23,7 +23,7 @@ $ conda activate myenv
 $ pip install vllm
 ```
 
-Please refer to the {ref}`installation documentation <installation>` for more details on installing vLLM.
+Please refer to the [installation documentation](#installation-index) for more details on installing vLLM.
 
 (offline-batched-inference)=
 

diff --git a/docs/source/getting_started/debugging.md → ...source/getting_started/troubleshooting.md b/docs/source/getting_started/debugging.md → ...source/getting_started/troubleshooting.md
@@ -1,8 +1,8 @@
-(debugging)=
+(troubleshooting)=
 
-# Debugging Tips
+# Troubleshooting
 
-This document outlines some debugging strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
+This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
 
 ```{note}
 Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
@@ -47,6 +47,7 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>`
 If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
 To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
 
+(troubleshooting-incorrect-hardware-driver)=
 ## Incorrect hardware/driver
 
 If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
@@ -139,7 +140,7 @@ A multi-node environment is more complicated than a single-node one. If you see
 Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
 ```
 
-(debugging-python-multiprocessing)=
+(troubleshooting-python-multiprocessing)=
 ## Python multiprocessing
 
 ### `RuntimeError` Exception
@@ -150,7 +151,7 @@ If you have seen a warning in your logs like this:
 WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
     initialized. We must use the `spawn` multiprocessing start method. Setting
     VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
-    https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
+    https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing
     for more information.
 ```
 

diff --git a/docs/source/index.md b/docs/source/index.md
@@ -50,26 +50,19 @@ For more information, check out the following:
 - [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
 - [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
 - [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
-- {ref}`vLLM Meetups <meetups>`.
+- [vLLM Meetups](#meetups)
 
 ## Documentation
 
 ```{toctree}
 :caption: Getting Started
 :maxdepth: 1
 
-getting_started/installation
-getting_started/amd-installation
-getting_started/openvino-installation
-getting_started/cpu-installation
-getting_started/gaudi-installation
-getting_started/arm-installation
-getting_started/neuron-installation
-getting_started/tpu-installation
-getting_started/xpu-installation
+getting_started/installation/index
 getting_started/quickstart
-getting_started/debugging
 getting_started/examples/examples_index
+getting_started/troubleshooting
+getting_started/faq
 ```
 
 ```{toctree}
@@ -110,7 +103,6 @@ usage/structured_outputs
 usage/spec_decode
 usage/compatibility_matrix
 usage/performance
-usage/faq
 usage/engine_args
 usage/env_vars
 usage/usage_stats

diff --git a/docs/source/models/generative_models.md b/docs/source/models/generative_models.md
@@ -120,7 +120,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)
 
 ## Online Inference
 
-Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
+Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
 
 - [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text.
 - [Chat API](#chat-api)  is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template.
diff --git a/docs/source/models/pooling_models.md b/docs/source/models/pooling_models.md
@@ -106,7 +106,7 @@ A code example can be found here: <gh-file:examples/offline_inference_scoring.py
 
 ## Online Inference
 
-Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
+Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
 
 - [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
 - [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.

diff --git a/docs/source/serving/distributed_serving.md b/docs/source/serving/distributed_serving.md
@@ -95,7 +95,7 @@ $     --tensor-parallel-size 16
 To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
 
 ```{warning}
-After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](../getting_started/debugging.md) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
+After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
 ```
 
 ```{warning}

diff --git a/docs/source/usage/spec_decode.md b/docs/source/usage/spec_decode.md
@@ -182,7 +182,7 @@ speculative decoding, breaking down the guarantees into three key areas:
 3. **vLLM Logprob Stability**
    \- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
    same request across runs. For more details, see the FAQ section
-   titled *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs <faq>`.
+   titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
 
 **Conclusion**
 
@@ -195,7 +195,7 @@ can occur due to following factors:
 
 **Mitigation Strategies**
 
-For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs <faq>`.
+For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
 
 ## Resources for vLLM contributors
 

diff --git a/docs/source/usage/structured_outputs.md b/docs/source/usage/structured_outputs.md
@@ -18,7 +18,7 @@ The following parameters are supported, which must be added as extra parameters:
 - `guided_whitespace_pattern`: used to override the default whitespace pattern for guided json decoding.
 - `guided_decoding_backend`: used to select the guided decoding backend to use.
 
-You can see the complete list of supported parameters on the [OpenAI Compatible Server](../serving/openai_compatible_server.md) page.
+You can see the complete list of supported parameters on the [OpenAI-Compatible Server](#openai-compatible-server)page.
 
 Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:
 

diff --git a/vllm/utils.py b/vllm/utils.py
@@ -1938,7 +1938,7 @@ def _check_multiproc_method():
                        "the `spawn` multiprocessing start method. Setting "
                        "VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. "
                        "See https://docs.vllm.ai/en/latest/getting_started/"
-                       "debugging.html#python-multiprocessing "
+                       "troubleshooting.html#python-multiprocessing "
                        "for more information.")
         os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"