deploy: 40cbd99

lm-sys · Jul 26, 2024 · b9bdf9d · b9bdf9d
1 parent e709089
commit b9bdf9d
Show file tree

Hide file tree

Showing 64 changed files with 64 additions and 62 deletions.
diff --git a/404/index.html b/404/index.html
diff --git a/_next/data/v3uQtzNY8VzRuInQz0q3R/about.json → _next/data/5c9Zaa5RTVu_kWjqzP3Hs/about.json b/_next/data/v3uQtzNY8VzRuInQz0q3R/about.json → _next/data/5c9Zaa5RTVu_kWjqzP3Hs/about.json
diff --git a/_next/data/5c9Zaa5RTVu_kWjqzP3Hs/blog.json b/_next/data/5c9Zaa5RTVu_kWjqzP3Hs/blog.json
diff --git a/...8VzRuInQz0q3R/blog/2023-03-30-vicuna.json → ...TVu_kWjqzP3Hs/blog/2023-03-30-vicuna.json b/...8VzRuInQz0q3R/blog/2023-03-30-vicuna.json → ...TVu_kWjqzP3Hs/blog/2023-03-30-vicuna.json
diff --git a/...Y8VzRuInQz0q3R/blog/2023-05-03-arena.json → ...RTVu_kWjqzP3Hs/blog/2023-05-03-arena.json b/...Y8VzRuInQz0q3R/blog/2023-05-03-arena.json → ...RTVu_kWjqzP3Hs/blog/2023-05-03-arena.json
diff --git a/...InQz0q3R/blog/2023-05-10-leaderboard.json → ...WjqzP3Hs/blog/2023-05-10-leaderboard.json b/...InQz0q3R/blog/2023-05-10-leaderboard.json → ...WjqzP3Hs/blog/2023-05-10-leaderboard.json
diff --git a/...InQz0q3R/blog/2023-05-25-leaderboard.json → ...WjqzP3Hs/blog/2023-05-25-leaderboard.json b/...InQz0q3R/blog/2023-05-25-leaderboard.json → ...WjqzP3Hs/blog/2023-05-25-leaderboard.json
diff --git a/...uInQz0q3R/blog/2023-06-09-api-server.json → ...kWjqzP3Hs/blog/2023-06-09-api-server.json b/...uInQz0q3R/blog/2023-06-09-api-server.json → ...kWjqzP3Hs/blog/2023-06-09-api-server.json
diff --git a/...InQz0q3R/blog/2023-06-22-leaderboard.json → ...WjqzP3Hs/blog/2023-06-22-leaderboard.json b/...InQz0q3R/blog/2023-06-22-leaderboard.json → ...WjqzP3Hs/blog/2023-06-22-leaderboard.json
diff --git a/...zRuInQz0q3R/blog/2023-06-29-longchat.json → ...u_kWjqzP3Hs/blog/2023-06-29-longchat.json b/...zRuInQz0q3R/blog/2023-06-29-longchat.json → ...u_kWjqzP3Hs/blog/2023-06-29-longchat.json
diff --git a/...VzRuInQz0q3R/blog/2023-07-20-dataset.json → ...Vu_kWjqzP3Hs/blog/2023-07-20-dataset.json b/...VzRuInQz0q3R/blog/2023-07-20-dataset.json → ...Vu_kWjqzP3Hs/blog/2023-07-20-dataset.json
diff --git a/...RuInQz0q3R/blog/2023-10-30-toxicchat.json → ..._kWjqzP3Hs/blog/2023-10-30-toxicchat.json b/...RuInQz0q3R/blog/2023-10-30-toxicchat.json → ..._kWjqzP3Hs/blog/2023-10-30-toxicchat.json
diff --git a/...R/blog/2023-11-14-llm-decontaminator.json → ...s/blog/2023-11-14-llm-decontaminator.json b/...R/blog/2023-11-14-llm-decontaminator.json → ...s/blog/2023-11-14-llm-decontaminator.json
diff --git a/...Y8VzRuInQz0q3R/blog/2023-11-15-slora.json → ...RTVu_kWjqzP3Hs/blog/2023-11-15-slora.json b/...Y8VzRuInQz0q3R/blog/2023-11-15-slora.json → ...RTVu_kWjqzP3Hs/blog/2023-11-15-slora.json
diff --git a/...R/blog/2023-11-21-lookahead-decoding.json → ...s/blog/2023-11-21-lookahead-decoding.json b/...R/blog/2023-11-21-lookahead-decoding.json → ...s/blog/2023-11-21-lookahead-decoding.json
diff --git a/...InQz0q3R/blog/2023-12-07-leaderboard.json → ...WjqzP3Hs/blog/2023-12-07-leaderboard.json b/...InQz0q3R/blog/2023-12-07-leaderboard.json → ...WjqzP3Hs/blog/2023-12-07-leaderboard.json
diff --git a/...8VzRuInQz0q3R/blog/2024-01-17-sglang.json → ...TVu_kWjqzP3Hs/blog/2024-01-17-sglang.json b/...8VzRuInQz0q3R/blog/2024-01-17-sglang.json → ...TVu_kWjqzP3Hs/blog/2024-01-17-sglang.json
diff --git a/...z0q3R/blog/2024-02-05-compressed-fsm.json → ...zP3Hs/blog/2024-02-05-compressed-fsm.json b/...z0q3R/blog/2024-02-05-compressed-fsm.json → ...zP3Hs/blog/2024-02-05-compressed-fsm.json
diff --git a/...8VzRuInQz0q3R/blog/2024-03-01-policy.json → ...TVu_kWjqzP3Hs/blog/2024-03-01-policy.json b/...8VzRuInQz0q3R/blog/2024-03-01-policy.json → ...TVu_kWjqzP3Hs/blog/2024-03-01-policy.json
diff --git a/...uInQz0q3R/blog/2024-04-19-arena-hard.json → ...kWjqzP3Hs/blog/2024-04-19-arena-hard.json b/...uInQz0q3R/blog/2024-04-19-arena-hard.json → ...kWjqzP3Hs/blog/2024-04-19-arena-hard.json
diff --git a/...R/blog/2024-05-02-kaggle-competition.json → ...s/blog/2024-05-02-kaggle-competition.json b/...R/blog/2024-05-02-kaggle-competition.json → ...s/blog/2024-05-02-kaggle-competition.json
diff --git a/...8VzRuInQz0q3R/blog/2024-05-08-llama3.json → ...TVu_kWjqzP3Hs/blog/2024-05-08-llama3.json b/...8VzRuInQz0q3R/blog/2024-05-08-llama3.json → ...TVu_kWjqzP3Hs/blog/2024-05-08-llama3.json
diff --git a/...Qz0q3R/blog/2024-05-17-category-hard.json → ...qzP3Hs/blog/2024-05-17-category-hard.json b/...Qz0q3R/blog/2024-05-17-category-hard.json → ...qzP3Hs/blog/2024-05-17-category-hard.json
diff --git a/...uInQz0q3R/blog/2024-06-27-multimodal.json → ...kWjqzP3Hs/blog/2024-06-27-multimodal.json b/...uInQz0q3R/blog/2024-06-27-multimodal.json → ...kWjqzP3Hs/blog/2024-06-27-multimodal.json
diff --git a/...zRuInQz0q3R/blog/2024-07-01-routellm.json → ...u_kWjqzP3Hs/blog/2024-07-01-routellm.json b/...zRuInQz0q3R/blog/2024-07-01-routellm.json → ...u_kWjqzP3Hs/blog/2024-07-01-routellm.json
diff --git a/_next/data/5c9Zaa5RTVu_kWjqzP3Hs/blog/2024-07-25-sglang-llama3.json b/_next/data/5c9Zaa5RTVu_kWjqzP3Hs/blog/2024-07-25-sglang-llama3.json
@@ -0,0 +1 @@
+{"pageProps":{"frontmatter":{"title":"Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM)","author":"The SGLang Team","date":"Jul 25, 2024","previewImg":"/images/blog/sglang_llama3/preview.png"},"content":"\nAt LMSYS.org, we've been running the [Chatbot Arena](https://chat.lmsys.org/) platform for over a year, serving millions of users. We know firsthand how crucial efficient serving is for AI products and research. Through our operational experiences and in-depth research, we've continuously enhanced the underlying serving systems, spanning from the high-level multi-model serving framework, [FastChat](https://github.com/lm-sys/FastChat/tree/main), to the efficient serving engine, [SGLang Runtime (SRT)](https://github.com/sgl-project/sglang/tree/main).\n\nThis post focuses on SGLang Runtime, a general-purpose serving engine for LLMs and VLMs. While existing options like TensorRT-LLM, vLLM, MLC-LLM, and Hugging Face TGI have their merits, we found them sometimes hard to use, difficult to customize, or lacking in performance. This motivated us to develop SGLang v0.2, aiming to create a serving engine that is not only user-friendly and easily modifiable but also delivers top-tier performance. While SGLang includes frontend language features, this post will focus solely on the backend runtime and use \"SGLang\" and \"SGLang Runtime\" interchangeably to refer to the runtime.\n\nCompared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. **SGLang consistently outperforms vLLM, achieving up to 3.8x higher throughput on Llama-70B. It also often matches or exceeds TensorRT-LLM, with up to 2.1x higher throughput on Llama-405B.** More importantly, SGLang is fully open-source, written in pure Python, with the core schedulers implemented in fewer than 4K lines of code.\n\nSGLang is an open-source project licensed under the Apache 2.0 license. It has been used by LMSYS Chatbot Arena to support parts of the models, Databricks, several startups, and research institutes, generating trillions of tokens and enabling faster iterations. As it gradually matures from a research prototype, we invite the community to join us in creating the next-generation efficient engine.\n\n## Benchmark Setup\n\nWe benchmark both offline and online use cases:\n\n- **Offline:** We send 2K to 3K requests at once, measuring output throughput (tokens/second), defined as the number of output tokens divided by the total duration. We test synthetic datasets derived from the ShareGPT dataset. For example, I-512-O-1024 indicates a dataset with an average input of 512 tokens and an average output of 1024 tokens. The five tested datasets are: Dataset 1: I-243-O-770, Dataset 2: I-295-O-770, Dataset 3: I-243-O-386, Dataset 4: I-295-O-386, Dataset 5: I-221-O-201.\n- **Online:** We send requests at rates ranging from 1 to 16 requests per second (RPS), measuring the median end-to-end latency. We use the synthetic dataset I-292-O-579.\n\nWe use vLLM 0.5.2 with default arguments and TensorRT-LLM with the recommended arguments and tuned batch sizes. The prefix cache is turned off for all engines. The purpose is to benchmark the base performance without any additional features, such as speculative decoding or caching.\nWe use OpenAI-compatible APIs to benchmark SGLang and vLLM, and the Triton interface for TensorRT-LLM.\n\nMore details and reproducible scripts are provided in Appendix A. For each model, we will first present the offline results and then present the online results.\n\n> Update (2024-07-25 8 PM PST): The dataset descriptions above are accurate but differ from the initial version of this blog post. We identified some issues in our synthetic data generation pipeline, so we corrected the dataset description to reflect the actual tested datasets. The comparison is still fair because all engines are benchmarked under the same conditions. The issues caused our benchmark to cover only the normal ShareGPT dataset distribution but miss long prompt cases. We are working on obtaining more benchmark results for longer prompts. However, we expect the speedup of SGLang to be less significant for long prompts since it primarily accelerates the decoding phase.\n\n## Llama-8B on 1 x A100 (bf16)\n\nStarting with the small model Llama-8B, the figure below shows the maximum output throughput each engine can achieve in offline settings across five different datasets. Both TensorRT-LLM and SGLang can achieve a throughput of approximately 4000 tokens per second, while vLLM falls behind.\n\n<img src=\"/images/blog/sglang_llama3/8b_throughput.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\nThe online benchmark figure below shows a trend similar to the offline case. TensorRT-LLM and SGLang perform equally well and can sustain an RPS \\> 10, while the latency of vLLM increases significantly at a high request rate.  \n\n<img src=\"/images/blog/sglang_llama3/8b_latency.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n## Llama-70B on 8 x A100 (bf16)\n\nMoving to the larger Llama-70B models with tensor parallelism on 8 GPUs, the trend is similar to the case with 8B. In the offline benchmark below, both TensorRT-LLM and SGLang can scale to a high throughput.   \n\n<img src=\"/images/blog/sglang_llama3/70b_bf16_throughput.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\nIn the online figure below, TensorRT-LLM shows excellent latency performance thanks to its highly efficient kernel implementations and runtime.   \n\n<img src=\"/images/blog/sglang_llama3/70b_bf16_latency.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n\n## Llama-70B on 8 x H100 (fp8)\n\nNow, let us test the FP8 performance. Both vLLM and SGLang use FP8 kernels from CUTLASS. In the offline setting, SGLang’s batch scheduler is very efficient and can continue to scale the throughput with larger batch sizes, achieving the highest throughput in this case. Other systems cannot scale their throughput or batch sizes due to OOM, missing extensive manual tuning, or other overheads. This trend continues in the online case as well, with both SGLang and TensorRT achieving similar median latency.  \n\n<img src=\"/images/blog/sglang_llama3/70b_fp8_throughput.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n<br>\n\n<img src=\"/images/blog/sglang_llama3/70b_fp8_latency.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n## Llama-405B on 8 x H100 (fp8)\n\nAt last, we benchmark the performance on the largest 405B model. Because the model is large, most of the time is spent on the GPU kernels. The gap between different frameworks shrinks. The poor performance of TensorRT-LLM is probably due to the fact that the 405B model just came out, and the version we used in the provided image has not integrated some latest optimizations. In both online and offline cases, SGLang performs the best.\n\n<img src=\"/images/blog/sglang_llama3/405b_fp8_throughput.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n<br>\n\n<img src=\"/images/blog/sglang_llama3/405b_fp8_latency.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n## SGLang Overview\n\nSGLang is a serving framework for large language models and vision-language models. It builds on and enhances many good designs from several open-source LLM serving engines, including [LightLLM](https://github.com/ModelTC/lightllm), [vLLM](https://blog.vllm.ai/2023/06/20/vllm.html), and [Guidance](https://github.com/guidance-ai/guidance). It leverages high-performance attention CUDA kernels from [FlashInfer](https://flashinfer.ai/2024/02/02/introduce-flashinfer.html) and integrates torch.compile inspired by [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2/).\n\nAdditionally, we introduced innovations such as [RadixAttention](https://arxiv.org/abs/2312.07104) for automatic KV cache reuse and [compressed state machine](https://lmsys.org/blog/2024-02-05-compressed-fsm/) for fast constrained decoding. SGLang is known for its highly efficient [batch scheduler](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/managers), which is implemented entirely in Python.\nTo make an apples-to-apples comparison, this blog tests the base performance of these serving engines with scenario- or workload-specific optimizations (like prefix caching and speculative decoding) turned off. The speedup in SGLang is achieved through proper engineering.\nSGLang's efficient Python-based batch scheduler scales well, often matching or even outperforming closed-source implementations built with C++.\n\nTable 1 compares various aspects of SGLang, TensorRT-LLM, and vLLM. In terms of performance, both SGLang and TensorRT-LLM excel. Regarding usability and customizability, SGLang's lightweight and modular core makes it easy to customize, whereas TensorRT-LLM's complex C++ tech stack and setup instructions make it harder to use and modify. SGLang's source code is fully open-source, while TensorRT-LLM is only partially open-source. In contrast, vLLM suffers from high CPU scheduling overhead.\n\nTable. 1 Comparison\n\n|  | SGLang | TensorRT-LLM | vLLM |\n| :---- | :---- | :---- | :---- |\n| Performance | Excellent | Excellent | Fair |\n| Usability | Good | Poor | Good |\n| Customizability | High | Low | Medium |\n| Source Code Availability | Fully Open | Partially Open | Fully Open |\n| Programming Language | Python | C++ | Python |\n\n## What is Next\n\nWe're excited to share our latest benchmark results. While there's still more to do, this shows our philosophy of developing a simple, customizable, and high-performance serving engine is achievable. Stay tuned for new features like long context and MoE optimizations, and detailed technical walkthroughs. Join us in building the next-generation serving engine at [https://github.com/sgl-project/sglang](https://github.com/sgl-project/sglang).\n\n## Try Llama Serving\n\nYou can serve a Llama model easily with the following steps.\n\n1. [Install](https://github.com/sgl-project/sglang/tree/main?tab=readme-ov-file#install) SGLang with pip, from source, or using Docker.\n2. Launch a server:\n    ```\n    # Llama 8B\n    python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct\n\n    # Llama 405B\n    python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8\n    ```\n3. Send a request with the OpenAI-compatible API:\n    ```\n    curl http://localhost:30000/v1/completions \\\n      -H \"Content-Type: application/json\" \\\n      -d '{\n        \"model\": \"default\",\n        \"prompt\": \"Say this is a test\",\n        \"max_tokens\": 7,\n        \"temperature\": 0\n      }'\n    ```\n4. Run the benchmark:\n    ```\n    python3 -m sglang.bench_serving --backend sglang --num-prompts 1000\n    ```\n\n## The Team\n\nThis blog post is contributed by Liangsheng Yin, Yineng Zhang, Ying Sheng, and over 65 open-source [contributors](https://github.com/sgl-project/sglang/graphs/contributors). We thank the support from Databricks, and Ying Sheng’s work was done at Databricks. We especially thank Lianmin Zheng, Zihao Ye, and Horace He for their technical support, Matei Zaharia for his helpful advice, and Cody Yu for his feedback.\n\n## Appendix A: Detailed Benchmark Setups\n\nThe instructions to reproduce the benchmark is at [sglang/benchmark/blog\\_v0\\_2](https://github.com/sgl-project/sglang/tree/main/benchmark/blog\\_v0\\_2).\n\nFor all benchmarks, we set \\`ignore\\_eos\\` or \\`min\\_length/end\\_id\\` to ensure each engine outputs the same number of tokens. We tried using vLLM 0.5.3.post1, but it often crashes under high loads and seems to have similar or worse performance compared to vLLM 0.5.2 from our partial benchmarking. Therefore, we report results from vLLM 0.5.2 instead. While we are aware that different server configurations can significantly impact serving performance, we mostly use the default arguments in each engine to mimic the case of a normal user.\n\nFor the 8B and 70B models, we use the [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and [meta-llama/Meta-Llama-3-70B-Instruct](http://meta-llama/Meta-Llama-3-70B-Instruct) bf16 checkpoints, and the [neuralmagic/Meta-Llama-3-70B-Instruct-FP8](https://huggingface.co/neuralmagic/Meta-Llama-3-70B-Instruct-FP8) fp8 checkpoint. For the 405B models, we use dummy weights for all benchmarks. Since the TensorRT-LLM latest image r24.06 does not support fbgemm\\_fp8 quantization in the official [meta-llama/Meta-Llama-3.1-405B-FP8](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-FP8) checkpoint, we use per-layer fp8 quantization in all frameworks and quantize all layers except lm\\_head. We believe this provides a fair comparison among all engines. The A100 and H100 GPUs are 80GB SXM versions.\n","slug":"2024-07-25-sglang-llama3"},"__N_SSG":true}
diff --git a/...data/v3uQtzNY8VzRuInQz0q3R/donations.json → ...data/5c9Zaa5RTVu_kWjqzP3Hs/donations.json b/...data/v3uQtzNY8VzRuInQz0q3R/donations.json → ...data/5c9Zaa5RTVu_kWjqzP3Hs/donations.json
diff --git a/...ta/v3uQtzNY8VzRuInQz0q3R/vicuna_eval.json → ...ta/5c9Zaa5RTVu_kWjqzP3Hs/vicuna_eval.json b/...ta/v3uQtzNY8VzRuInQz0q3R/vicuna_eval.json → ...ta/5c9Zaa5RTVu_kWjqzP3Hs/vicuna_eval.json
diff --git a/_next/data/v3uQtzNY8VzRuInQz0q3R/blog.json b/_next/data/v3uQtzNY8VzRuInQz0q3R/blog.json
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"pageProps":{"frontmatter":{"title":"Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM)","author":"The SGLang Team","date":"Jul 25, 2024","previewImg":"/images/blog/sglang_llama3/preview.png"},"content":"\nAt LMSYS.org, we've been running the [Chatbot Arena](https://chat.lmsys.org/) platform for over a year, serving millions of users. We know firsthand how crucial efficient serving is for AI products and research. Through our operational experiences and in-depth research, we've continuously enhanced the underlying serving systems, spanning from the high-level multi-model serving framework, [FastChat](https://github.com/lm-sys/FastChat/tree/main), to the efficient serving engine, [SGLang Runtime (SRT)](https://github.com/sgl-project/sglang/tree/main).\n\nThis post focuses on SGLang Runtime, a general-purpose serving engine for LLMs and VLMs. While existing options like TensorRT-LLM, vLLM, MLC-LLM, and Hugging Face TGI have their merits, we found them sometimes hard to use, difficult to customize, or lacking in performance. This motivated us to develop SGLang v0.2, aiming to create a serving engine that is not only user-friendly and easily modifiable but also delivers top-tier performance. While SGLang includes frontend language features, this post will focus solely on the backend runtime and use \"SGLang\" and \"SGLang Runtime\" interchangeably to refer to the runtime.\n\nCompared to TensorRT-LLM and vLLM, SGLang Runtime consistently delivers superior or competitive performance in both online and offline scenarios, handling models from Llama-8B to Llama-405B, and on A100 and H100 GPUs, using FP8 and FP16. SGLang consistently outperforms vLLM, achieving up to 3.8x higher throughput on Llama-70B. It also often matches or exceeds TensorRT-LLM, with up to 2.1x higher throughput on Llama-405B. More importantly, SGLang is fully open-source, written in pure Python, with the core schedulers implemented in fewer than 4K lines of code.\n\nSGLang is an open-source project licensed under the Apache 2.0 license. It has been used by LMSYS Chatbot Arena to support parts of the models, Databricks, several startups, and research institutes, generating trillions of tokens and enabling faster iterations. As it gradually matures from a research prototype, we invite the community to join us in creating the next-generation efficient engine.\n\n## Benchmark Setup\n\nWe benchmark both offline and online use cases:\n\n- Offline: We send 2K to 3K requests at once, measuring output throughput (tokens/second), defined as the number of output tokens divided by the total duration. We test synthetic datasets derived from the ShareGPT dataset. For example, I-512-O-1024 indicates a dataset with an average input of 512 tokens and an average output of 1024 tokens. The five tested datasets are: Dataset 1: I-243-O-770, Dataset 2: I-295-O-770, Dataset 3: I-243-O-386, Dataset 4: I-295-O-386, Dataset 5: I-221-O-201.\n- Online: We send requests at rates ranging from 1 to 16 requests per second (RPS), measuring the median end-to-end latency. We use the synthetic dataset I-292-O-579.\n\nWe use vLLM 0.5.2 with default arguments and TensorRT-LLM with the recommended arguments and tuned batch sizes. The prefix cache is turned off for all engines. The purpose is to benchmark the base performance without any additional features, such as speculative decoding or caching.\nWe use OpenAI-compatible APIs to benchmark SGLang and vLLM, and the Triton interface for TensorRT-LLM.\n\nMore details and reproducible scripts are provided in Appendix A. For each model, we will first present the offline results and then present the online results.\n\n> Update (2024-07-25 8 PM PST): The dataset descriptions above are accurate but differ from the initial version of this blog post. We identified some issues in our synthetic data generation pipeline, so we corrected the dataset description to reflect the actual tested datasets. The comparison is still fair because all engines are benchmarked under the same conditions. The issues caused our benchmark to cover only the normal ShareGPT dataset distribution but miss long prompt cases. We are working on obtaining more benchmark results for longer prompts. However, we expect the speedup of SGLang to be less significant for long prompts since it primarily accelerates the decoding phase.\n\n## Llama-8B on 1 x A100 (bf16)\n\nStarting with the small model Llama-8B, the figure below shows the maximum output throughput each engine can achieve in offline settings across five different datasets. Both TensorRT-LLM and SGLang can achieve a throughput of approximately 4000 tokens per second, while vLLM falls behind.\n\n<img src=\"/images/blog/sglang_llama3/8b_throughput.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\nThe online benchmark figure below shows a trend similar to the offline case. TensorRT-LLM and SGLang perform equally well and can sustain an RPS \\> 10, while the latency of vLLM increases significantly at a high request rate. \n\n<img src=\"/images/blog/sglang_llama3/8b_latency.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n## Llama-70B on 8 x A100 (bf16)\n\nMoving to the larger Llama-70B models with tensor parallelism on 8 GPUs, the trend is similar to the case with 8B. In the offline benchmark below, both TensorRT-LLM and SGLang can scale to a high throughput. \n\n<img src=\"/images/blog/sglang_llama3/70b_bf16_throughput.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\nIn the online figure below, TensorRT-LLM shows excellent latency performance thanks to its highly efficient kernel implementations and runtime. \n\n<img src=\"/images/blog/sglang_llama3/70b_bf16_latency.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n\n## Llama-70B on 8 x H100 (fp8)\n\nNow, let us test the FP8 performance. Both vLLM and SGLang use FP8 kernels from CUTLASS. In the offline setting, SGLang’s batch scheduler is very efficient and can continue to scale the throughput with larger batch sizes, achieving the highest throughput in this case. Other systems cannot scale their throughput or batch sizes due to OOM, missing extensive manual tuning, or other overheads. This trend continues in the online case as well, with both SGLang and TensorRT achieving similar median latency. \n\n<img src=\"/images/blog/sglang_llama3/70b_fp8_throughput.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n<br>\n\n<img src=\"/images/blog/sglang_llama3/70b_fp8_latency.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n## Llama-405B on 8 x H100 (fp8)\n\nAt last, we benchmark the performance on the largest 405B model. Because the model is large, most of the time is spent on the GPU kernels. The gap between different frameworks shrinks. The poor performance of TensorRT-LLM is probably due to the fact that the 405B model just came out, and the version we used in the provided image has not integrated some latest optimizations. In both online and offline cases, SGLang performs the best.\n\n<img src=\"/images/blog/sglang_llama3/405b_fp8_throughput.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n<br>\n\n<img src=\"/images/blog/sglang_llama3/405b_fp8_latency.svg\" style=\"display: flex; margin-top: auto; margin-left: auto; margin-right: auto; margin-bottom: auto; width: 70%;\"></img>\n\n## SGLang Overview\n\nSGLang is a serving framework for large language models and vision-language models. It builds on and enhances many good designs from several open-source LLM serving engines, including [LightLLM](https://github.com/ModelTC/lightllm), [vLLM](https://blog.vllm.ai/2023/06/20/vllm.html), and [Guidance](https://github.com/guidance-ai/guidance). It leverages high-performance attention CUDA kernels from [FlashInfer](https://flashinfer.ai/2024/02/02/introduce-flashinfer.html) and integrates torch.compile inspired by [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2/).\n\nAdditionally, we introduced innovations such as [RadixAttention](https://arxiv.org/abs/2312.07104) for automatic KV cache reuse and [compressed state machine](https://lmsys.org/blog/2024-02-05-compressed-fsm/) for fast constrained decoding. SGLang is known for its highly efficient [batch scheduler](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/managers), which is implemented entirely in Python.\nTo make an apples-to-apples comparison, this blog tests the base performance of these serving engines with scenario- or workload-specific optimizations (like prefix caching and speculative decoding) turned off. The speedup in SGLang is achieved through proper engineering.\nSGLang's efficient Python-based batch scheduler scales well, often matching or even outperforming closed-source implementations built with C++.\n\nTable 1 compares various aspects of SGLang, TensorRT-LLM, and vLLM. In terms of performance, both SGLang and TensorRT-LLM excel. Regarding usability and customizability, SGLang's lightweight and modular core makes it easy to customize, whereas TensorRT-LLM's complex C++ tech stack and setup instructions make it harder to use and modify. SGLang's source code is fully open-source, while TensorRT-LLM is only partially open-source. In contrast, vLLM suffers from high CPU scheduling overhead.\n\nTable. 1 Comparison\n\n\| \| SGLang \| TensorRT-LLM \| vLLM \|\n\| :---- \| :---- \| :---- \| :---- \|\n\| Performance \| Excellent \| Excellent \| Fair \|\n\| Usability \| Good \| Poor \| Good \|\n\| Customizability \| High \| Low \| Medium \|\n\| Source Code Availability \| Fully Open \| Partially Open \| Fully Open \|\n\| Programming Language \| Python \| C++ \| Python \|\n\n## What is Next\n\nWe're excited to share our latest benchmark results. While there's still more to do, this shows our philosophy of developing a simple, customizable, and high-performance serving engine is achievable. Stay tuned for new features like long context and MoE optimizations, and detailed technical walkthroughs. Join us in building the next-generation serving engine at [https://github.com/sgl-project/sglang](https://github.com/sgl-project/sglang).\n\n## Try Llama Serving\n\nYou can serve a Llama model easily with the following steps.\n\n1. [Install](https://github.com/sgl-project/sglang/tree/main?tab=readme-ov-file#install) SGLang with pip, from source, or using Docker.\n2. Launch a server:\n ```\n # Llama 8B\n python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct\n\n # Llama 405B\n python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8\n ```\n3. Send a request with the OpenAI-compatible API:\n ```\n curl http://localhost:30000/v1/completions \\\n -H \"Content-Type: application/json\" \\\n -d '{\n \"model\": \"default\",\n \"prompt\": \"Say this is a test\",\n \"max_tokens\": 7,\n \"temperature\": 0\n }'\n ```\n4. Run the benchmark:\n ```\n python3 -m sglang.bench_serving --backend sglang --num-prompts 1000\n ```\n\n## The Team\n\nThis blog post is contributed by Liangsheng Yin, Yineng Zhang, Ying Sheng, and over 65 open-source [contributors](https://github.com/sgl-project/sglang/graphs/contributors). We thank the support from Databricks, and Ying Sheng’s work was done at Databricks. We especially thank Lianmin Zheng, Zihao Ye, and Horace He for their technical support, Matei Zaharia for his helpful advice, and Cody Yu for his feedback.\n\n## Appendix A: Detailed Benchmark Setups\n\nThe instructions to reproduce the benchmark is at [sglang/benchmark/blog\\_v0\\_2](https://github.com/sgl-project/sglang/tree/main/benchmark/blog\\_v0\\_2).\n\nFor all benchmarks, we set \\`ignore\\_eos\\` or \\`min\\_length/end\\_id\\` to ensure each engine outputs the same number of tokens. We tried using vLLM 0.5.3.post1, but it often crashes under high loads and seems to have similar or worse performance compared to vLLM 0.5.2 from our partial benchmarking. Therefore, we report results from vLLM 0.5.2 instead. While we are aware that different server configurations can significantly impact serving performance, we mostly use the default arguments in each engine to mimic the case of a normal user.\n\nFor the 8B and 70B models, we use the [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and [meta-llama/Meta-Llama-3-70B-Instruct](http://meta-llama/Meta-Llama-3-70B-Instruct) bf16 checkpoints, and the [neuralmagic/Meta-Llama-3-70B-Instruct-FP8](https://huggingface.co/neuralmagic/Meta-Llama-3-70B-Instruct-FP8) fp8 checkpoint. For the 405B models, we use dummy weights for all benchmarks. Since the TensorRT-LLM latest image r24.06 does not support fbgemm\\_fp8 quantization in the official [meta-llama/Meta-Llama-3.1-405B-FP8](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-FP8) checkpoint, we use per-layer fp8 quantization in all frameworks and quantize all layers except lm\\_head. We believe this provides a fair comparison among all engines. The A100 and H100 GPUs are 80GB SXM versions.\n","slug":"2024-07-25-sglang-llama3"},"__N_SSG":true}