Skip to content

Commit

Permalink
Documentation for using EAGLE in vLLM
Browse files Browse the repository at this point in the history
Signed-off-by: Sourashis Roy <[email protected]>
  • Loading branch information
sroy745 committed Dec 22, 2024
1 parent aa24b0c commit 241f0f0
Showing 1 changed file with 52 additions and 0 deletions.
52 changes: 52 additions & 0 deletions docs/source/usage/spec_decode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,58 @@ A variety of speculative models of this type are available on HF hub:
* `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
* `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_

Speculating using Eagle based draft models
-------------------------------------------

The following code configures vLLM to use speculative decoding where proposals are generated by
a `EAGLE(Extrapolation Algorithm for Greater Language-model Efficiency)<https://arxiv.org/pdf/2401.15077>` based draft model.

.. code-block:: python
from vllm import LLM, SamplingParams
prompts = [
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
speculative_model="ibm-fms/llama3-70b-accelerator",
speculative_draft_tensor_parallel_size=1,
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
A few important things to consider when using the EAGLE based draft models.

1. The EAGLE based draft models currently need to be run without tensor parallelism, although
it is possible to run the main model using tensor parallelism (see example above). Since the
speculative models are relatively small, we still see significant speedups. However, this
limitation will be fixed in a future release.

2. The EAGLE draft models available in this Hugging Face repository cannot be used directly
with vLLM due to differences in the expected layer names and model definition. To use these
models with vLLM, use the provided script to convert them. Note that this script does not
modify the model's weights.


3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
expected when using EAGLE-based draft models for speculative decoding.
This issue is under investigation and tracked here: `https://github.com/vllm-project/vllm/issues/9565`.
Known differences between the vLLM implementation of EAGLE-based speculation and the original EAGLE implementation include:

a. ......
b. .....

A variety of EAGLE draft models are available on HF hub:


Lossless guarantees of Speculative Decoding
-------------------------------------------
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
Expand Down

0 comments on commit 241f0f0

Please sign in to comment.