Skip to content

Commit

Permalink
[doc] format fix (vllm-project#10789)
Browse files Browse the repository at this point in the history
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
  • Loading branch information
wangxiyuan authored and afeldman-nm committed Dec 2, 2024
1 parent 7831672 commit a877540
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 19 deletions.
2 changes: 1 addition & 1 deletion docs/source/automatic_prefix_caching/details.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ With this mapping, we can add another indirection in vLLM’s KV cache managemen
This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.


# Generalized Caching Policy
## Generalized Caching Policy

Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.

Expand Down
36 changes: 18 additions & 18 deletions docs/source/getting_started/gaudi-installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Installation with Intel® Gaudi® AI Accelerators
This README provides instructions on running vLLM with Intel Gaudi devices.

Requirements and Installation
=============================
-----------------------------

Please follow the instructions provided in the `Gaudi Installation
Guide <https://docs.habana.ai/en/latest/Installation_Guide/index.html>`__
Expand All @@ -13,7 +13,7 @@ please follow the methods outlined in the `Optimizing Training Platform
Guide <https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html>`__.

Requirements
------------
~~~~~~~~~~~~

- OS: Ubuntu 22.04 LTS
- Python: 3.10
Expand All @@ -22,7 +22,7 @@ Requirements


Quick start using Dockerfile
----------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: console
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
Expand All @@ -34,10 +34,10 @@ Quick start using Dockerfile


Build from source
-----------------
~~~~~~~~~~~~~~~~~

Environment verification
~~~~~~~~~~~~~~~~~~~~~~~~
^^^^^^^^^^^^^^^^^^^^^^^^

To verify that the Intel Gaudi software was correctly installed, run:

Expand All @@ -53,7 +53,7 @@ Verification <https://docs.habana.ai/en/latest/Installation_Guide/SW_Verificatio
for more details.

Run Docker Image
~~~~~~~~~~~~~~~~
^^^^^^^^^^^^^^^^

It is highly recommended to use the latest Docker image from Intel Gaudi
vault. Refer to the `Intel Gaudi
Expand All @@ -68,7 +68,7 @@ Use the following commands to run a Docker image:
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
Build and Install vLLM
~~~~~~~~~~~~~~~~~~~~~~
^^^^^^^^^^^^^^^^^^^^^^

To build and install vLLM from source, run:

Expand All @@ -90,7 +90,7 @@ Currently, the latest features and performance optimizations are developed in Ga
Supported Features
==================
------------------

- `Offline batched
inference <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference>`__
Expand All @@ -107,15 +107,15 @@ Supported Features
- Attention with Linear Biases (ALiBi)

Unsupported Features
====================
--------------------

- Beam search
- LoRA adapters
- Quantization
- Prefill chunking (mixed-batch inferencing)

Supported Configurations
========================
------------------------

The following configurations have been validated to be function with
Gaudi2 devices. Configurations that are not listed may or may not work.
Expand Down Expand Up @@ -152,10 +152,10 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling

Performance Tuning
==================
------------------

Execution modes
---------------
~~~~~~~~~~~~~~~

Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via ``PT_HPU_LAZY_MODE`` environment variable), and ``--enforce-eager`` flag.

Expand Down Expand Up @@ -184,7 +184,7 @@ Currently in vLLM for HPU we support four execution modes, depending on selected


Bucketing mechanism
-------------------
~~~~~~~~~~~~~~~~~~~

Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. `Intel Gaudi Graph Compiler <https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime>`__ is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - ``batch_size`` and ``sequence_length``.
Expand Down Expand Up @@ -233,7 +233,7 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come
Bucketing is transparent to a client - padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.

Warmup
------
~~~~~~

Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:

Expand All @@ -257,7 +257,7 @@ This example uses the same buckets as in *Bucketing mechanism* section. Each out
Compiling all the buckets might take some time and can be turned off with ``VLLM_SKIP_WARMUP=true`` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.

HPU Graph capture
-----------------
~~~~~~~~~~~~~~~~~

`HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__ are currently the most performant execution method of vLLM on Intel Gaudi. When HPU Graphs are enabled, execution graphs will be traced (recorded) ahead of time (after performing warmup), to be later replayed during inference, significantly reducing host overheads. Recording can take large amounts of memory, which needs to be taken into account when allocating KV cache. Enabling HPU Graphs will impact the number of available KV cache blocks, but vLLM provides user-configurable variables to control memory management.

Expand Down Expand Up @@ -321,7 +321,7 @@ Each described step is logged by vLLM server, as follows (negative values corres
Recommended vLLM Parameters
---------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~

- We recommend running inference on Gaudi 2 with ``block_size`` of 128
for BF16 data type. Using default values (16, 32) might lead to
Expand All @@ -333,7 +333,7 @@ Recommended vLLM Parameters
If you encounter out-of-memory issues, see troubleshooting section.

Environment variables
---------------------
~~~~~~~~~~~~~~~~~~~~~

**Diagnostic and profiling knobs:**

Expand Down Expand Up @@ -380,7 +380,7 @@ Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM
- ``PT_HPU_ENABLE_LAZY_COLLECTIVES``: required to be ``true`` for tensor parallel inference with HPU Graphs

Troubleshooting: Tweaking HPU Graphs
====================================
------------------------------------

If you experience device out-of-memory issues or want to attempt
inference at higher batch sizes, try tweaking HPU Graphs by following
Expand Down

0 comments on commit a877540

Please sign in to comment.