Skip to content

Commit

Permalink
Sync main
Browse files Browse the repository at this point in the history
Signed-off-by: Jee Jee Li <[email protected]>
  • Loading branch information
jeejeelee committed Dec 30, 2024
2 parents 789c888 + 0aa38d1 commit 2135d63
Show file tree
Hide file tree
Showing 98 changed files with 4,132 additions and 2,514 deletions.
1 change: 1 addition & 0 deletions docs/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,4 @@ openai # Required by docs/source/serving/openai_compatible_server.md's vllm.entr
fastapi # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
partial-json-parser # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
requests
zmq
3 changes: 2 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,7 @@ def linkcode_resolve(domain, info):

# Mock out external dependencies here, otherwise the autodoc pages may be blank.
autodoc_mock_imports = [
"blake3",
"compressed_tensors",
"cpuinfo",
"cv2",
Expand All @@ -207,7 +208,7 @@ def linkcode_resolve(domain, info):
"tensorizer",
"pynvml",
"outlines",
"xgrammar,"
"xgrammar",
"librosa",
"soundfile",
"gguf",
Expand Down
6 changes: 3 additions & 3 deletions docs/source/contributing/dockerfile/dockerfile.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ Below is a visual representation of the multi-stage Dockerfile. The build graph

The edges of the build graph represent:

- FROM ... dependencies (with a solid line and a full arrow head)
- `FROM ...` dependencies (with a solid line and a full arrow head)

- COPY --from=... dependencies (with a dashed line and an empty arrow head)
- `COPY --from=...` dependencies (with a dashed line and an empty arrow head)

- RUN --mount=(.\*)from=... dependencies (with a dotted line and an empty diamond arrow head)
- `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)

> ```{figure} ../../assets/dev/dockerfile-stages-dependency.png
> :align: center
Expand Down
2 changes: 1 addition & 1 deletion docs/source/contributing/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ pytest tests/
```

```{note}
Currently, the repository does not pass the `mypy` tests.
Currently, the repository is not fully checked by `mypy`.
```

# Contribution Guidelines
Expand Down
24 changes: 12 additions & 12 deletions docs/source/design/multimodal/multimodal_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,39 +45,39 @@ adding_multimodal_plugin
### Base Classes

```{eval-rst}
.. autodata:: vllm.multimodal.NestedTensors
.. automodule:: vllm.multimodal.base
:members:
:show-inheritance:
```

```{eval-rst}
.. autodata:: vllm.multimodal.BatchedTensorInputs
```
### Input Classes

```{eval-rst}
.. autoclass:: vllm.multimodal.MultiModalDataBuiltins
.. automodule:: vllm.multimodal.inputs
:members:
:show-inheritance:
```

```{eval-rst}
.. autodata:: vllm.multimodal.MultiModalDataDict
```
### Audio Classes

```{eval-rst}
.. autoclass:: vllm.multimodal.MultiModalKwargs
.. automodule:: vllm.multimodal.audio
:members:
:show-inheritance:
```

### Image Classes

```{eval-rst}
.. autoclass:: vllm.multimodal.MultiModalPlugin
.. automodule:: vllm.multimodal.image
:members:
:show-inheritance:
```

### Image Classes
### Video Classes

```{eval-rst}
.. automodule:: vllm.multimodal.image
.. automodule:: vllm.multimodal.video
:members:
:show-inheritance:
```
2 changes: 1 addition & 1 deletion docs/source/getting_started/arm-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Contents:
## Requirements

- **Operating System**: Linux or macOS
- **Compiler**: gcc/g++ >= 12.3.0 (optional, but recommended)
- **Compiler**: `gcc/g++ >= 12.3.0` (optional, but recommended)
- **Instruction Set Architecture (ISA)**: NEON support is required

(arm-backend-quick-start-dockerfile)=
Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting_started/cpu-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Table of contents:
## Requirements

- OS: Linux
- Compiler: gcc/g++>=12.3.0 (optional, recommended)
- Compiler: `gcc/g++>=12.3.0` (optional, recommended)
- Instruction set architecture (ISA) requirement: AVX512 (optional, recommended)

(cpu-backend-quick-start-dockerfile)=
Expand Down Expand Up @@ -69,7 +69,7 @@ $ VLLM_TARGET_DEVICE=cpu python setup.py install

```{note}
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, will brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable VLLM_CPU_AVX512BF16=1 before the building.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
```

(env-intro)=
Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/debugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,4 +197,4 @@ if __name__ == '__main__':
## Known Issues

- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) .
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable `NCCL_CUMEM_ENABLE=0` to disable NCCL's `cuMem` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) .
47 changes: 24 additions & 23 deletions docs/source/getting_started/gaudi-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,32 +141,33 @@ Gaudi2 devices. Configurations that are not listed may or may not work.

Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag.

```{eval-rst}
.. list-table:: vLLM execution modes
:widths: 25 25 50
:header-rows: 1
* - ``PT_HPU_LAZY_MODE``
- ``enforce_eager``
- execution mode
* - 0
- 0
- torch.compile
* - 0
- 1
- PyTorch eager mode
* - 1
- 0
- HPU Graphs
* - 1
- 1
- PyTorch lazy mode
```{list-table} vLLM execution modes
:widths: 25 25 50
:header-rows: 1
* - `PT_HPU_LAZY_MODE`
- `enforce_eager`
- execution mode
* - 0
- 0
- torch.compile
* - 0
- 1
- PyTorch eager mode
* - 1
- 0
- HPU Graphs
* - 1
- 1
- PyTorch lazy mode
```

```{warning}
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
```

(gaudi-bucketing-mechanism)=

### Bucketing mechanism

Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
Expand All @@ -185,7 +186,7 @@ INFO 08-01 21:37:59 hpu_model_runner.py:504] Decode bucket config (min, step, ma
INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
```

`min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling - `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.
`min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling -- `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.

Example (with ramp-up)

Expand Down Expand Up @@ -214,7 +215,7 @@ If a request exceeds maximum bucket size in any dimension, it will be processed
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.

```{note}
Bucketing is transparent to a client - padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
```

### Warmup
Expand All @@ -235,7 +236,7 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
```

This example uses the same buckets as in *Bucketing mechanism* section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.

```{tip}
Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/neuron-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Installation steps:
(build-from-source-neuron)=

```{note}
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with vLLM >= 0.5.3. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
```

## Build from source
Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ $ "temperature": 0
$ }'
```

Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` python package:
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:

```python
from openai import OpenAI
Expand Down Expand Up @@ -151,7 +151,7 @@ $ ]
$ }'
```

Alternatively, you can use the `openai` python package:
Alternatively, you can use the `openai` Python package:

```python
from openai import OpenAI
Expand Down
55 changes: 27 additions & 28 deletions docs/source/getting_started/tpu-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,33 +68,32 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--service-account SERVICE_ACCOUNT
```

```{eval-rst}
.. list-table:: Parameter descriptions
:header-rows: 1
* - Parameter name
- Description
* - QUEUED_RESOURCE_ID
- The user-assigned ID of the queued resource request.
* - TPU_NAME
- The user-assigned name of the TPU which is created when the queued
resource request is allocated.
* - PROJECT_ID
- Your Google Cloud project
* - ZONE
- The GCP zone where you want to create your Cloud TPU. The value you use
depends on the version of TPUs you are using. For more information, see
`TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_
* - ACCELERATOR_TYPE
- The TPU version you want to use. Specify the TPU version, for example
`v5litepod-4` specifies a v5e TPU with 4 cores. For more information,
see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
* - RUNTIME_VERSION
- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
* - SERVICE_ACCOUNT
- The email address for your service account. You can find it in the IAM
Cloud Console under *Service Accounts*. For example:
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`
```{list-table} Parameter descriptions
:header-rows: 1
* - Parameter name
- Description
* - QUEUED_RESOURCE_ID
- The user-assigned ID of the queued resource request.
* - TPU_NAME
- The user-assigned name of the TPU which is created when the queued
resource request is allocated.
* - PROJECT_ID
- Your Google Cloud project
* - ZONE
- The GCP zone where you want to create your Cloud TPU. The value you use
depends on the version of TPUs you are using. For more information, see
`TPU regions and zones <https://cloud.google.com/tpu/docs/regions-zones>`_
* - ACCELERATOR_TYPE
- The TPU version you want to use. Specify the TPU version, for example
`v5litepod-4` specifies a v5e TPU with 4 cores. For more information,
see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
* - RUNTIME_VERSION
- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
* - SERVICE_ACCOUNT
- The email address for your service account. You can find it in the IAM
Cloud Console under *Service Accounts*. For example:
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`
```

Connect to your TPU using SSH:
Expand All @@ -103,7 +102,7 @@ Connect to your TPU using SSH:
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE
```

Install Miniconda
Install Miniconda:

```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Expand Down
Loading

0 comments on commit 2135d63

Please sign in to comment.