Skip to content

Commit

Permalink
Correct benchmark setups
Browse files Browse the repository at this point in the history
  • Loading branch information
Ying1123 committed Jul 26, 2024
1 parent c409c66 commit 734d913
Show file tree
Hide file tree
Showing 5 changed files with 9 additions and 7 deletions.
8 changes: 5 additions & 3 deletions blog/2024-07-25-sglang-llama3.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,12 @@ SGLang is an open-source project licensed under the Apache 2.0 license. It has b

## Benchmark Setup

We benchmark both offline and online use cases.
> **Update (2024-07-25 7PM):** We've identified issues in our synthetic dataset generation pipeline, resulting in mostly short prompts. While this means the datasets don't match our earlier descriptions, the comparison remains fair since all engines are benchmarked under the same conditions. We've updated the benchmark setup description to reflect the characteristics of the generated synthetic datasets. We're working on obtaining more benchmark results for longer prompts, but we expect the speedup of SGLang to be less since it primarily accelerates the decoding phase.
- For the offline case, we send 2K to 3K requests at once, measuring output throughput (tokens/second), which is defined as the number of output tokens divided by the total duration. We test using the ShareGPT dataset and several synthetic datasets. We use In\[2048, 4096\]-Out\[256, 512\] to indicate a synthetic dataset with input lengths sampled from a uniform distribution \[2048, 4096\] and output lengths from \[256, 512\].
- For the online case, we send requests at a rate ranging from 1 to 16 requests per second (RPS), measuring the median end-to-end latency. We use a synthetic dataset In\[512, 4096\]-Out\[128, 1024\].
We benchmark both offline and online use cases:

- **Offline:** We send 2K to 3K requests at once, measuring output throughput (tokens/second), defined as the number of output tokens divided by the total duration. We test synthetic datasets derived from the ShareGPT dataset. For example, I-512-O-1024 indicates a dataset with an average input of 512 tokens and an average output of 1024 tokens. The five tested datasets are: Dataset 1: I-243-O-770, Dataset 2: I-295-O-770, Dataset 3: I-243-O-386, Dataset 4: I-295-O-386, Dataset 5: I-221-O-201.
- **Online:** We send requests at rates ranging from 1 to 16 requests per second (RPS), measuring the median end-to-end latency. We use the synthetic dataset I-292-O-579.

We use vLLM 0.5.2 with default arguments and TensorRT-LLM with the recommended arguments and tuned batch sizes. The prefix cache is turned off for all engines. The purpose is to benchmark the base performance without any additional features, such as speculative decoding or caching.
We use OpenAI-compatible APIs to benchmark SGLang and vLLM, and the Triton interface for TensorRT-LLM.
Expand Down
2 changes: 1 addition & 1 deletion public/images/blog/sglang_llama3/405b_fp8_throughput.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion public/images/blog/sglang_llama3/70b_bf16_throughput.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion public/images/blog/sglang_llama3/70b_fp8_throughput.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion public/images/blog/sglang_llama3/8b_throughput.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 734d913

Please sign in to comment.