From c409c66c9526ca08a9f322b7ab2ed6383e08d536 Mon Sep 17 00:00:00 2001 From: Ying Sheng Date: Thu, 25 Jul 2024 12:53:24 -0700 Subject: [PATCH] Update SGLang July release blogpost (#113) --- blog/2024-07-25-sglang-llama3.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/blog/2024-07-25-sglang-llama3.md b/blog/2024-07-25-sglang-llama3.md index 9244de9e..7d1b3608 100644 --- a/blog/2024-07-25-sglang-llama3.md +++ b/blog/2024-07-25-sglang-llama3.md @@ -71,6 +71,8 @@ At last, we benchmark the performance on the largest 405B model. Because the mod SGLang is a serving framework for large language models and vision-language models. It builds on and enhances many good designs from several open-source LLM serving engines, including [LightLLM](https://github.com/ModelTC/lightllm), [vLLM](https://blog.vllm.ai/2023/06/20/vllm.html), and [Guidance](https://github.com/guidance-ai/guidance). It leverages high-performance attention CUDA kernels from [FlashInfer](https://flashinfer.ai/2024/02/02/introduce-flashinfer.html) and integrates torch.compile inspired by [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2/). Additionally, we introduced innovations such as [RadixAttention](https://arxiv.org/abs/2312.07104) for automatic KV cache reuse and [compressed state machine](https://lmsys.org/blog/2024-02-05-compressed-fsm/) for fast constrained decoding. SGLang is known for its highly efficient [batch scheduler](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/managers), which is implemented entirely in Python. +To make an apples-to-apples comparison, this blog tests the base performance of these serving engines with scenario- or workload-specific optimizations (like prefix caching and speculative decoding) turned off. The speedup in SGLang is achieved through proper engineering. +SGLang's efficient Python-based batch scheduler scales well, often matching or even outperforming closed-source implementations built with C++. Table 1 compares various aspects of SGLang, TensorRT-LLM, and vLLM. In terms of performance, both SGLang and TensorRT-LLM excel. Regarding usability and customizability, SGLang's lightweight and modular core makes it easy to customize, whereas TensorRT-LLM's complex C++ tech stack and setup instructions make it harder to use and modify. SGLang's source code is fully open-source, while TensorRT-LLM is only partially open-source. In contrast, vLLM suffers from high CPU scheduling overhead.