Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Improve GitHub links #11491

Merged
merged 3 commits into from
Dec 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,35 @@
html_static_path = ["_static"]
html_js_files = ["custom.js"]

myst_url_schemes = {
'http': None,
'https': None,
'mailto': None,
'ftp': None,
"gh-issue": {
"url":
"https://github.com/vllm-project/vllm/issues/{{path}}#{{fragment}}",
"title": "Issue #{{path}}",
"classes": ["github"],
},
"gh-pr": {
"url":
"https://github.com/vllm-project/vllm/pull/{{path}}#{{fragment}}",
"title": "Pull Request #{{path}}",
"classes": ["github"],
},
"gh-dir": {
"url": "https://github.com/vllm-project/vllm/tree/main/{{path}}",
"title": "{{path}}",
"classes": ["github"],
},
"gh-file": {
"url": "https://github.com/vllm-project/vllm/blob/main/{{path}}",
"title": "{{path}}",
"classes": ["github"],
},
}

# see https://docs.readthedocs.io/en/stable/reference/environment-variables.html # noqa
READTHEDOCS_VERSION_TYPE = os.environ.get('READTHEDOCS_VERSION_TYPE')
if READTHEDOCS_VERSION_TYPE == "tag":
Expand Down
4 changes: 2 additions & 2 deletions docs/source/contributing/dockerfile/dockerfile.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Dockerfile

See [here](https://github.com/vllm-project/vllm/blob/main/Dockerfile) for the main Dockerfile to construct
the image for running an OpenAI compatible server with vLLM. More information about deploying with Docker can be found [here](https://docs.vllm.ai/en/stable/serving/deploying_with_docker.html).
We provide a <gh-file:Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
More information about deploying with Docker can be found [here](../../serving/deploying_with_docker.md).

Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:

Expand Down
14 changes: 7 additions & 7 deletions docs/source/contributing/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,12 @@ Finally, one of the most impactful ways to support us is by raising awareness ab

## License

See [LICENSE](https://github.com/vllm-project/vllm/tree/main/LICENSE).
See <gh-file:LICENSE>.

## Developing

Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation. Check out the [building from source](https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source) documentation for details.
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
Check out the [building from source](#build-from-source) documentation for details.

## Testing

Expand All @@ -43,7 +44,7 @@ Currently, the repository does not pass the `mypy` tests.
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.

```{important}
If you discover a security vulnerability, please follow the instructions [here](https://github.com/vllm-project/vllm/tree/main/SECURITY.md#reporting-a-vulnerability).
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
```

## Pull Requests & Code Reviews
Expand All @@ -54,9 +55,9 @@ code quality and improve the efficiency of the review process.

### DCO and Signed-off-by

When contributing changes to this project, you must agree to the [DCO](https://github.com/vllm-project/vllm/tree/main/DCO).
When contributing changes to this project, you must agree to the <gh-file:DCO>.
Commits must include a `Signed-off-by:` header which certifies agreement with
the terms of the [DCO](https://github.com/vllm-project/vllm/tree/main/DCO).
the terms of the DCO.

Using `-s` with `git commit` will automatically add this header.

Expand Down Expand Up @@ -89,8 +90,7 @@ If the PR spans more than one category, please include all relevant prefixes.
The PR needs to meet the following code quality standards:

- We adhere to [Google Python style guide](https://google.github.io/styleguide/pyguide.html) and [Google C++ style guide](https://google.github.io/styleguide/cppguide.html).
- Pass all linter checks. Please use [format.sh](https://github.com/vllm-project/vllm/blob/main/format.sh) to format your
code.
- Pass all linter checks. Please use <gh-file:format.sh> to format your code.
- The code needs to be well-documented to ensure future contributors can easily
understand the code.
- Include sufficient tests to ensure the project stays correct and robust. This
Expand Down
8 changes: 4 additions & 4 deletions docs/source/contributing/profiling/profiling_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,13 @@ Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the serve
`export VLLM_RPC_TIMEOUT=1800000`
```

## Example commands and usage:
## Example commands and usage

### Offline Inference:
### Offline Inference

Refer to [examples/offline_inference_with_profiler.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_with_profiler.py) for an example.
Refer to <gh-file:examples/offline_inference_with_profiler.py> for an example.

### OpenAI Server:
### OpenAI Server

```bash
VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B
Expand Down
17 changes: 6 additions & 11 deletions docs/source/design/arch_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ for output in outputs:
More API details can be found in the {doc}`Offline Inference
</dev/offline_inference/offline_index>` section of the API docs.

The code for the `LLM` class can be found in [vllm/entrypoints/llm.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py).
The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>.

### OpenAI-compatible API server

Expand All @@ -66,7 +66,7 @@ This server can be started using the `vllm serve` command.
vllm serve <model>
```

The code for the `vllm` CLI can be found in [vllm/scripts.py](https://github.com/vllm-project/vllm/blob/main/vllm/scripts.py).
The code for the `vllm` CLI can be found in <gh-file:vllm/scripts.py>.

Sometimes you may see the API server entrypoint used directly instead of via the
`vllm` CLI command. For example:
Expand All @@ -75,7 +75,7 @@ Sometimes you may see the API server entrypoint used directly instead of via the
python -m vllm.entrypoints.openai.api_server --model <model>
```

That code can be found in [vllm/entrypoints/openai/api_server.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py).
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.

More details on the API server can be found in the {doc}`OpenAI Compatible
Server </serving/openai_compatible_server>` document.
Expand Down Expand Up @@ -105,7 +105,7 @@ processing.
- **Output Processing**: Processes the outputs generated by the model, decoding the
token IDs from a language model into human-readable text.

The code for `LLMEngine` can be found in [vllm/engine/llm_engine.py].
The code for `LLMEngine` can be found in <gh-file:vllm/engine/llm_engine.py>.

### AsyncLLMEngine

Expand All @@ -115,10 +115,9 @@ incoming requests. The `AsyncLLMEngine` is designed for online serving, where it
can handle multiple concurrent requests and stream outputs to clients.

The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo
API server that serves as a simpler example in
[vllm/entrypoints/api_server.py].
API server that serves as a simpler example in <gh-file:vllm/entrypoints/api_server.py>.

The code for `AsyncLLMEngine` can be found in [vllm/engine/async_llm_engine.py].
The code for `AsyncLLMEngine` can be found in <gh-file:vllm/engine/async_llm_engine.py>.

## Worker

Expand Down Expand Up @@ -252,7 +251,3 @@ big problem.

In summary, the complete config object `VllmConfig` can be treated as an
engine-level global state that is shared among all vLLM classes.

[vllm/engine/async_llm_engine.py]: https://github.com/vllm-project/vllm/tree/main/vllm/engine/async_llm_engine.py
[vllm/engine/llm_engine.py]: https://github.com/vllm-project/vllm/tree/main/vllm/engine/llm_engine.py
[vllm/entrypoints/api_server.py]: https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/api_server.py
27 changes: 14 additions & 13 deletions docs/source/design/multiprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@

## Debugging

Please see the [Debugging
Tips](https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing)
Please see the [Debugging Tips](#debugging-python-multiprocessing)
page for information on known issues and how to solve them.

## Introduction

*Note that source code references are to the state of the code at the time of writing in December, 2024.*
```{important}
The source code references are to the state of the code at the time of writing in December, 2024.
```

The use of Python multiprocessing in vLLM is complicated by:

Expand All @@ -20,7 +21,7 @@ This document describes how vLLM deals with these challenges.

## Multiprocessing Methods

[Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) include:
[Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html.md#contexts-and-start-methods) include:

- `spawn` - spawn a new Python process. This will be the default as of Python
3.14.
Expand Down Expand Up @@ -82,7 +83,7 @@ There are other miscellaneous places hard-coding the use of `spawn`:

Related PRs:

- <https://github.com/vllm-project/vllm/pull/8823>
- <gh-pr:8823>

## Prior State in v1

Expand All @@ -96,7 +97,7 @@ engine core.

- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/llm_engine.py#L93-L95>
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/llm_engine.py#L70-L77>
- https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/core_client.py#L44-L45
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/core_client.py#L44-L45>

It was off by default for all the reasons mentioned above - compatibility with
dependencies and code using vLLM as a library.
Expand All @@ -119,17 +120,17 @@ instruct users to either add a `__main__` guard or to disable multiprocessing.
If that known-failure case occurs, the user will see two messages that explain
what is happening. First, a log message from vLLM:

```
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
initialized. We must use the `spawn` multiprocessing start method. Setting
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
for more information.
```console
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
initialized. We must use the `spawn` multiprocessing start method. Setting
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
for more information.
```

Second, Python itself will raise an exception with a nice explanation:

```
```console
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
Expand Down
3 changes: 1 addition & 2 deletions docs/source/generate_examples.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,10 @@ def generate_examples():

# Generate the example docs for each example script
for script_path, doc_path in zip(script_paths, doc_paths):
script_url = f"https://github.com/vllm-project/vllm/blob/main/examples/{script_path.name}"
# Make script_path relative to doc_path and call it include_path
include_path = '../../../..' / script_path.relative_to(root_dir)
content = (f"{generate_title(doc_path.stem)}\n\n"
f"Source: <{script_url}>.\n\n"
f"Source: <gh-file:examples/{script_path.name}>.\n\n"
f"```{{literalinclude}} {include_path}\n"
":language: python\n"
":linenos:\n```")
Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting_started/amd-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Installation options:

You can build and install vLLM from source.

First, build a docker image from [Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/Dockerfile.rocm) and launch a docker container from the image.
First, build a docker image from <gh-file:Dockerfile.rocm> and launch a docker container from the image.
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:

```console
Expand All @@ -33,7 +33,7 @@ It is important that the user kicks off the docker build using buildkit. Either
}
```

[Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/Dockerfile.rocm) uses ROCm 6.2 by default, but also supports ROCm 5.7, 6.0 and 6.1 in older vLLM branches.
<gh-file:Dockerfile.rocm> uses ROCm 6.2 by default, but also supports ROCm 5.7, 6.0 and 6.1 in older vLLM branches.
It provides flexibility to customize the build of docker image using the following arguments:

- `BASE_IMAGE`: specifies the base image used when running `docker build`, specifically the PyTorch on ROCm base image.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting_started/cpu-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,10 +145,10 @@ $ python examples/offline_inference.py

- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the [topology](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa). For NUMA architecture, two optimizations are to recommended: Tensor Parallel or Data Parallel.

- Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With [TP feature on CPU](https://github.com/vllm-project/vllm/pull/6125) merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
- Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With [TP feature on CPU](gh-pr:6125) merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:

```console
$ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
```

- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](../serving/deploying_with_nginx) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](../serving/deploying_with_nginx.md) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
7 changes: 4 additions & 3 deletions docs/source/getting_started/debugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ To isolate the model downloading and loading issue, you can use the `--load-form

## Model is too large

If the model is too large to fit in a single GPU, you might want to [consider tensor parallelism](https://docs.vllm.ai/en/latest/serving/distributed_serving.html#distributed-inference-and-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using [this example](https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html) . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
If the model is too large to fit in a single GPU, you might want to [consider tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.

## Enable more logging

Expand Down Expand Up @@ -139,6 +139,7 @@ A multi-node environment is more complicated than a single-node one. If you see
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
```

(debugging-python-multiprocessing)=
## Python multiprocessing

### `RuntimeError` Exception
Expand Down Expand Up @@ -195,5 +196,5 @@ if __name__ == '__main__':

## Known Issues

- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](https://github.com/vllm-project/vllm/pull/6759).
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](https://github.com/vllm-project/vllm/issues/5723#issuecomment-2554389656) .
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) .
6 changes: 2 additions & 4 deletions docs/source/getting_started/gaudi-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,10 +80,8 @@ $ python setup.py develop

## Supported Features

- [Offline batched
inference](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference)
- Online inference via [OpenAI-Compatible
Server](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server)
- [Offline batched inference](#offline-batched-inference)
- Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
- HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
Expand Down
Loading
Loading