Documentation for using EAGLE in vLLM

Signed-off-by: Sourashis Roy <[email protected]>
vllm-project · Dec 22, 2024 · 241f0f0 · 241f0f0
1 parent aa24b0c
commit 241f0f0
Showing 1 changed file with 52 additions and 0 deletions.
diff --git a/docs/source/usage/spec_decode.rst b/docs/source/usage/spec_decode.rst
@@ -161,6 +161,58 @@ A variety of speculative models of this type are available on HF hub:
 * `granite-7b-instruct-accelerator <https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator>`_
 * `granite-20b-code-instruct-accelerator <https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator>`_
 
+Speculating using Eagle based draft models
+-------------------------------------------
+
+The following code configures vLLM to use speculative decoding where proposals are generated by
+a `EAGLE(Extrapolation Algorithm for Greater Language-model Efficiency)<https://arxiv.org/pdf/2401.15077>` based draft model.
+
+.. code-block:: python
+
+    from vllm import LLM, SamplingParams
+
+    prompts = [
+        "The future of AI is",
+    ]
+    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+
+    llm = LLM(
+        model="meta-llama/Meta-Llama-3.1-70B-Instruct",
+        tensor_parallel_size=4,
+        speculative_model="ibm-fms/llama3-70b-accelerator",
+        speculative_draft_tensor_parallel_size=1,
+    )
+    outputs = llm.generate(prompts, sampling_params)
+
+    for output in outputs:
+        prompt = output.prompt
+        generated_text = output.outputs[0].text
+        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+A few important things to consider when using the EAGLE based draft models.
+
+1. The EAGLE based draft models currently need to be run without tensor parallelism, although
+it is possible to run the main model using tensor parallelism (see example above). Since the
+speculative models are relatively small, we still see significant speedups. However, this
+limitation will be fixed in a future release.
+
+2. The EAGLE draft models available in this Hugging Face repository cannot be used directly
+with vLLM due to differences in the expected layer names and model definition. To use these
+models with vLLM, use the provided script to convert them. Note that this script does not 
+modify the model's weights.
+
+
+3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
+expected when using EAGLE-based draft models for speculative decoding.
+This issue is under investigation and tracked here: `https://github.com/vllm-project/vllm/issues/9565`.  
+Known differences between the vLLM implementation of EAGLE-based speculation and the original EAGLE implementation include:  
+
+    a. ......
+    b. .....
+
+A variety of EAGLE draft models are available on HF hub:
+
+
 Lossless guarantees of Speculative Decoding
 -------------------------------------------
 In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of