diff --git a/blog/2024-07-25-sglang-llama3.md b/blog/2024-07-25-sglang-llama3.md index 672f3b94..c5866c18 100644 --- a/blog/2024-07-25-sglang-llama3.md +++ b/blog/2024-07-25-sglang-llama3.md @@ -15,8 +15,6 @@ SGLang is an open-source project licensed under the Apache 2.0 license. It has b ## Benchmark Setup -> **Update (2024-07-25 7PM):** We've identified issues in our synthetic dataset generation pipeline, resulting in mostly short prompts. While this means the datasets don't match our earlier descriptions, the comparison remains fair since all engines are benchmarked under the same conditions. We've updated the benchmark setup description to reflect the characteristics of the generated synthetic datasets. We're working on obtaining more benchmark results for longer prompts, but we expect the speedup of SGLang to be less since it primarily accelerates the decoding phase. - We benchmark both offline and online use cases: - **Offline:** We send 2K to 3K requests at once, measuring output throughput (tokens/second), defined as the number of output tokens divided by the total duration. We test synthetic datasets derived from the ShareGPT dataset. For example, I-512-O-1024 indicates a dataset with an average input of 512 tokens and an average output of 1024 tokens. The five tested datasets are: Dataset 1: I-243-O-770, Dataset 2: I-295-O-770, Dataset 3: I-243-O-386, Dataset 4: I-295-O-386, Dataset 5: I-221-O-201. @@ -27,6 +25,8 @@ We use OpenAI-compatible APIs to benchmark SGLang and vLLM, and the Triton inter More details and reproducible scripts are provided in Appendix A. For each model, we will first present the offline results and then present the online results. +Update (2024-07-25 8 PM PST): The dataset descriptions above are accurate but differ from the initial version of this blog post. We identified some issues in our synthetic data generation pipeline, which we have now corrected. These issues caused our benchmark to cover the normal ShareGPT dataset distribution but miss long prompt cases. We are working on obtaining more benchmark results for longer prompts. However, we expect the speedup of SGLang to be less significant for long prompts since it primarily accelerates the decoding phase. + ## Llama-8B on 1 x A100 (bf16) Starting with the small model Llama-8B, the figure below shows the maximum output throughput each engine can achieve in offline settings across five different datasets. Both TensorRT-LLM and SGLang can achieve a throughput of approximately 4000 tokens per second, while vLLM falls behind.