From 241f0f0b315d2b0b36f537a92f314369c561615b Mon Sep 17 00:00:00 2001 From: Sourashis Roy Date: Sun, 22 Dec 2024 18:40:10 +0000 Subject: [PATCH] Documentation for using EAGLE in vLLM Signed-off-by: Sourashis Roy --- docs/source/usage/spec_decode.rst | 52 +++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/docs/source/usage/spec_decode.rst b/docs/source/usage/spec_decode.rst index f1f1917f974bb..1bf1773793bfe 100644 --- a/docs/source/usage/spec_decode.rst +++ b/docs/source/usage/spec_decode.rst @@ -161,6 +161,58 @@ A variety of speculative models of this type are available on HF hub: * `granite-7b-instruct-accelerator `_ * `granite-20b-code-instruct-accelerator `_ +Speculating using Eagle based draft models +------------------------------------------- + +The following code configures vLLM to use speculative decoding where proposals are generated by +a `EAGLE(Extrapolation Algorithm for Greater Language-model Efficiency)` based draft model. + +.. code-block:: python + + from vllm import LLM, SamplingParams + + prompts = [ + "The future of AI is", + ] + sampling_params = SamplingParams(temperature=0.8, top_p=0.95) + + llm = LLM( + model="meta-llama/Meta-Llama-3.1-70B-Instruct", + tensor_parallel_size=4, + speculative_model="ibm-fms/llama3-70b-accelerator", + speculative_draft_tensor_parallel_size=1, + ) + outputs = llm.generate(prompts, sampling_params) + + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + +A few important things to consider when using the EAGLE based draft models. + +1. The EAGLE based draft models currently need to be run without tensor parallelism, although +it is possible to run the main model using tensor parallelism (see example above). Since the +speculative models are relatively small, we still see significant speedups. However, this +limitation will be fixed in a future release. + +2. The EAGLE draft models available in this Hugging Face repository cannot be used directly +with vLLM due to differences in the expected layer names and model definition. To use these +models with vLLM, use the provided script to convert them. Note that this script does not +modify the model's weights. + + +3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is +expected when using EAGLE-based draft models for speculative decoding. +This issue is under investigation and tracked here: `https://github.com/vllm-project/vllm/issues/9565`. +Known differences between the vLLM implementation of EAGLE-based speculation and the original EAGLE implementation include: + + a. ...... + b. ..... + +A variety of EAGLE draft models are available on HF hub: + + Lossless guarantees of Speculative Decoding ------------------------------------------- In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of