diff --git a/blog/2024-07-25-sglang-llama3.md b/blog/2024-07-25-sglang-llama3.md index 7d1b3608..672f3b94 100644 --- a/blog/2024-07-25-sglang-llama3.md +++ b/blog/2024-07-25-sglang-llama3.md @@ -15,10 +15,12 @@ SGLang is an open-source project licensed under the Apache 2.0 license. It has b ## Benchmark Setup -We benchmark both offline and online use cases. +> **Update (2024-07-25 7PM):** We've identified issues in our synthetic dataset generation pipeline, resulting in mostly short prompts. While this means the datasets don't match our earlier descriptions, the comparison remains fair since all engines are benchmarked under the same conditions. We've updated the benchmark setup description to reflect the characteristics of the generated synthetic datasets. We're working on obtaining more benchmark results for longer prompts, but we expect the speedup of SGLang to be less since it primarily accelerates the decoding phase. -- For the offline case, we send 2K to 3K requests at once, measuring output throughput (tokens/second), which is defined as the number of output tokens divided by the total duration. We test using the ShareGPT dataset and several synthetic datasets. We use In\[2048, 4096\]-Out\[256, 512\] to indicate a synthetic dataset with input lengths sampled from a uniform distribution \[2048, 4096\] and output lengths from \[256, 512\]. -- For the online case, we send requests at a rate ranging from 1 to 16 requests per second (RPS), measuring the median end-to-end latency. We use a synthetic dataset In\[512, 4096\]-Out\[128, 1024\]. +We benchmark both offline and online use cases: + +- **Offline:** We send 2K to 3K requests at once, measuring output throughput (tokens/second), defined as the number of output tokens divided by the total duration. We test synthetic datasets derived from the ShareGPT dataset. For example, I-512-O-1024 indicates a dataset with an average input of 512 tokens and an average output of 1024 tokens. The five tested datasets are: Dataset 1: I-243-O-770, Dataset 2: I-295-O-770, Dataset 3: I-243-O-386, Dataset 4: I-295-O-386, Dataset 5: I-221-O-201. +- **Online:** We send requests at rates ranging from 1 to 16 requests per second (RPS), measuring the median end-to-end latency. We use the synthetic dataset I-292-O-579. We use vLLM 0.5.2 with default arguments and TensorRT-LLM with the recommended arguments and tuned batch sizes. The prefix cache is turned off for all engines. The purpose is to benchmark the base performance without any additional features, such as speculative decoding or caching. We use OpenAI-compatible APIs to benchmark SGLang and vLLM, and the Triton interface for TensorRT-LLM. diff --git a/public/images/blog/sglang_llama3/405b_fp8_throughput.svg b/public/images/blog/sglang_llama3/405b_fp8_throughput.svg index eddac73d..ea72b579 100644 --- a/public/images/blog/sglang_llama3/405b_fp8_throughput.svg +++ b/public/images/blog/sglang_llama3/405b_fp8_throughput.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/public/images/blog/sglang_llama3/70b_bf16_throughput.svg b/public/images/blog/sglang_llama3/70b_bf16_throughput.svg index 9262fd9a..abe57790 100644 --- a/public/images/blog/sglang_llama3/70b_bf16_throughput.svg +++ b/public/images/blog/sglang_llama3/70b_bf16_throughput.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/public/images/blog/sglang_llama3/70b_fp8_throughput.svg b/public/images/blog/sglang_llama3/70b_fp8_throughput.svg index bbeeef0f..20b1480d 100644 --- a/public/images/blog/sglang_llama3/70b_fp8_throughput.svg +++ b/public/images/blog/sglang_llama3/70b_fp8_throughput.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/public/images/blog/sglang_llama3/8b_throughput.svg b/public/images/blog/sglang_llama3/8b_throughput.svg index 3a34576d..f49b1932 100644 --- a/public/images/blog/sglang_llama3/8b_throughput.svg +++ b/public/images/blog/sglang_llama3/8b_throughput.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file