Skip to content

Commit

Permalink
Update SGLang July release blogpost (#113)
Browse files Browse the repository at this point in the history
  • Loading branch information
Ying1123 authored Jul 25, 2024
1 parent fe7ff4a commit c409c66
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions blog/2024-07-25-sglang-llama3.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,8 @@ At last, we benchmark the performance on the largest 405B model. Because the mod
SGLang is a serving framework for large language models and vision-language models. It builds on and enhances many good designs from several open-source LLM serving engines, including [LightLLM](https://github.com/ModelTC/lightllm), [vLLM](https://blog.vllm.ai/2023/06/20/vllm.html), and [Guidance](https://github.com/guidance-ai/guidance). It leverages high-performance attention CUDA kernels from [FlashInfer](https://flashinfer.ai/2024/02/02/introduce-flashinfer.html) and integrates torch.compile inspired by [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2/).

Additionally, we introduced innovations such as [RadixAttention](https://arxiv.org/abs/2312.07104) for automatic KV cache reuse and [compressed state machine](https://lmsys.org/blog/2024-02-05-compressed-fsm/) for fast constrained decoding. SGLang is known for its highly efficient [batch scheduler](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/managers), which is implemented entirely in Python.
To make an apples-to-apples comparison, this blog tests the base performance of these serving engines with scenario- or workload-specific optimizations (like prefix caching and speculative decoding) turned off. The speedup in SGLang is achieved through proper engineering.
SGLang's efficient Python-based batch scheduler scales well, often matching or even outperforming closed-source implementations built with C++.

Table 1 compares various aspects of SGLang, TensorRT-LLM, and vLLM. In terms of performance, both SGLang and TensorRT-LLM excel. Regarding usability and customizability, SGLang's lightweight and modular core makes it easy to customize, whereas TensorRT-LLM's complex C++ tech stack and setup instructions make it harder to use and modify. SGLang's source code is fully open-source, while TensorRT-LLM is only partially open-source. In contrast, vLLM suffers from high CPU scheduling overhead.

Expand Down

0 comments on commit c409c66

Please sign in to comment.