doc(readme): update to mention we support newer Llama versions with TGI

huggingface · Dec 13, 2024 · 8219b67 · 8219b67
1 parent 49c9f29
commit 8219b67
Show file tree

Hide file tree

Showing 2 changed files with 2 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -23,7 +23,7 @@ working closely with Google and Google Cloud to make this a reality.
 
 We currently support a few LLM models targeting text generation scenarios:
 - 💎 Gemma (2b, 7b)
-- 🦙 Llama2 (7b) and Llama3 (8b)
+- 🦙 Llama2 (7b) and Llama3 (8b). On Text Generation Inference with Jetstream Pytorch, also Llama3.1, Llama3.2 and Llama3.3 (text-only models) are supported, up to 70B parameters.
 - 💨 Mistral (7b)
 
 

diff --git a/docs/source/howto/serving.mdx b/docs/source/howto/serving.mdx
@@ -56,6 +56,6 @@ curl localhost/generate_stream \
 If for some reason you want to use the Pytorch/XLA backend instead, you can set the `JETSTREAM_PT_DISABLE=1` environment variable.
 
 
-When using Jetstream Pytorch engine, it is possible to enable quantization to reduce the memory footprint and increase the throughput. To enable quantization, set the `QUANTIZATION=1` environment variable.
+When using Jetstream Pytorch engine, it is possible to enable quantization to reduce the memory footprint and increase the throughput. To enable quantization, set the `QUANTIZATION=1` environment variable. For instance, on a 2x4 TPU v5e, you can serve models up to 70B parameters such as Llama 3.3-70B.
 
 ***Note: Quantization is still experimental and may produce lower quality results compared to the non-quantized version.***