[Benchmark] Benchmark structured output with datasets #10557

xuechendi · 2024-11-22T00:10:16Z

Add structure output benchmark.

Base PR: #10046
Additional work:

add 4 guided options - 'grammar', 'choice', 'regex' and 'json'; For Json, add single-json and use 'xgrammar_benchmark'(multi-schema)
add option for guided_decoding_ratio, default is 1.0, ratio < 1.0 provided a mixed requests containing both regular/guided
add correctness rate check
add first and next token latency when testing with AsyncEngine

How to test:

with guided decoding

python benchmarks/benchmark_guided.py --model meta-llama/Llama-3.2-3B-Instruct --dataset xgrammar_bench --async-engine --output-len 512 --num-prompts 10 --enable-chunked-prefill --guided-decoding-ratio 1.0 --save-results

How to test:

with no guided decoding

python benchmarks/benchmark_guided.py --model meta-llama/Llama-3.2-3B-Instruct --dataset xgrammar_bench --output-len 512 --num-prompts 10 --no-guided-decoding --save-results

Expected output
FileName: 1.0guided_Llama-3.2-3B-Instruct_xgrammar_bench_10_out512_asyncTrue_warmupTrue_chunkedprefillTrue.txt

    "elapsed_time": 50.00640656100586,
    "num_requests": 10,
    "total_num_tokens": 8086,
    "total_output_tokens": 5120,
    "requests_per_second": 0.1999743770390778,
    "tokens_per_second": "161.70",
    "output_tokens_per_second": "102.39",
    "correct_rate(%)": 100.0,
    "first_token_latency(msecs)": {
        "count": 10.0,
        "mean": 33074.16370075662,
        "std": 9578.487238858703,
        "min": 19958.579740021378,
        "25%": 25430.36330281757,
        "50%": 33561.18999654427,
        "75%": 40371.71240762109,
        "max": 46179.445307003334
    },
    "next_token_latency(msecs)": {
        "count": 10.0,
        "mean": 7.487070074518177,
        "std": 0.21345235857034978,
        "min": 7.0583598064933035,
        "25%": 7.452180709461126,
        "50%": 7.483904802198513,
        "75%": 7.609307728370618,
        "max": 7.772003888715707
    }

Signed-off-by: Aaron Pham <[email protected]>

github-actions · 2024-11-22T00:10:30Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

aarnphm

tiny comment.

benchmarks/benchmark_guided.py

…tput

Signed-off-by: Chendi Xue <[email protected]>

xuechendi · 2024-11-26T04:22:03Z

@simon-mo , please help to take a review

Signed-off-by: Chendi Xue <[email protected]>

simon-mo

I have a question regarding the measure of correctness for xgrammar dataset. Do we expect the model to be able to fully return 100% the correct output? Should we evaluate based on whether it matches the JSON schema instead of the content?

benchmarks/benchmark_guided.py

xuechendi · 2024-11-27T15:03:09Z

I have a question regarding the measure of correctness for xgrammar dataset. Do we expect the model to be able to fully return 100% the correct output? Should we evaluate based on whether it matches the JSON schema instead of the content?

Hello, @simon-mo

The correctness check only based on if the format can be successfully parsed according to the format type:
https://github.com/vllm-project/vllm/pull/10557/files#diff-be4e291d6b3d1360bc13597125d2aec6cb3fa6231834655cfc70dbb0a531234eR279-R289

And The reason you didn't see 100% in my example output is because I used '--guided-decoding-ratio 0.5', meaning 50% requests are guided_decoding request and 50% are using regular decoding. And regular decoding sometime can't generate json properly.
If changing the '--guided-decoding-ratio' as 1 or default, you should expect 100% correctness.

xuechendi · 2024-11-27T16:42:57Z

Hi, @simon-mo , I enabled warmup for non-xgrammar-dataset, here is the results:
Using single json schema + warmup

    "elapsed_time": 54.99144273600541,
    "num_requests": 128,
    "total_num_tokens": 139520,
    "total_output_tokens": 65536,
    "requests_per_second": 2.3276348761112344,
    "tokens_per_second": "2537.12",
    "output_tokens_per_second": "1191.75",
    "correct_rate(%)": 100.0,
    "first_token_latency(msecs)": {
        "count": 128.0,
        "mean": 14692.10625355845,
        "std": 7359.727861436729,
        "min": 1336.0014711506665,
        "25%": 8178.687368519604,
        "50%": 14545.75154162012,
        "75%": 21511.82376150973,
        "max": 27600.430445978418
    },
    "next_token_latency(msecs)": {
        "count": 128.0,
        "mean": 70.86720946193621,
        "std": 8.439623933853836,
        "min": 53.35845652065416,
        "25%": 63.384519656892735,
        "50%": 73.45476706154841,
        "75%": 77.98998092096099,
        "max": 81.07992057239244
    }

Using single json schema + skip warmup

    "elapsed_time": 88.94759224611335,
    "num_requests": 128,
    "total_num_tokens": 139520,
    "total_output_tokens": 65536,
    "requests_per_second": 1.4390496332473024,
    "tokens_per_second": "1568.56",
    "output_tokens_per_second": "736.79",
    "correct_rate(%)": 100.0,
    "first_token_latency(msecs)": {
        "count": 128.0,
        "mean": 48026.763826343085,
        "std": 7584.742728700046,
        "min": 36633.825331926346,
        "25%": 42391.85892988462,
        "50%": 47750.630808994174,
        "75%": 54306.403648806736,
        "max": 61219.33139488101
    },
    "next_token_latency(msecs)": {
        "count": 128.0,
        "mean": 73.12619282088733,
        "std": 9.379682187792639,
        "min": 54.036737062476334,
        "25%": 65.84560697577632,
        "50%": 75.79372507234439,
        "75%": 80.14213092189948,
        "max": 84.3030315213582
    }

Signed-off-by: Chendi Xue <[email protected]>

mgoin

I think this is in a good place to land as a base for development, especially considering we have been using it in all our xgrammar PRs :)

It will be great to have a serving benchmark soon after this so we can sweep QPS rates rather than having offline batching

…0557) Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: Aaron Pham <[email protected]>

aarnphm added 5 commits November 7, 2024 10:11

benchmark: add guided decoding script

f0b0c0d

Signed-off-by: Aaron Pham <[email protected]>

chore: add warmup args

c62b55b

Signed-off-by: Aaron Pham <[email protected]>

chore: run format accordingly

e64a701

Signed-off-by: Aaron Pham <[email protected]>

chore: add @mgoin's suggestion

a0c46f1

Signed-off-by: Aaron Pham <[email protected]>

chore: run format

91d9efc

Signed-off-by: Aaron Pham <[email protected]>

aarnphm mentioned this pull request Nov 22, 2024

[Benchmark] guided decoding #10046

Closed

aarnphm reviewed Nov 22, 2024

View reviewed changes

benchmarks/benchmark_guided.py Outdated Show resolved Hide resolved

xuechendi added 4 commits November 22, 2024 00:55

Merge branch 'pr10046_structured_output' into benchmark_structured_ou…

8e67db0

…tput

Add xgrammar similiar dataset

18c245c

Signed-off-by: Chendi Xue <[email protected]>

Add grammar, regex, choice, json - json support using file path

9bec7fc

Signed-off-by: Chendi Xue <[email protected]>

Asycn engine save results

ad531ae

Signed-off-by: Chendi Xue <[email protected]>

xuechendi force-pushed the benchmark_structured_output branch from 6534b9d to d67fc48 Compare November 25, 2024 22:48

xuechendi added 5 commits November 26, 2024 00:31

Add latency for first and Next token - async engine

b6164b7

Signed-off-by: Chendi Xue <[email protected]>

Add correctness check

b5fd6a2

Signed-off-by: Chendi Xue <[email protected]>

Add mix request using guided_decoding_ratio

d67fc48

Signed-off-by: Chendi Xue <[email protected]>

only check with format

28d69b4

Signed-off-by: Chendi Xue <[email protected]>

Fix correctness check error when no json detected

badf679

Signed-off-by: Chendi Xue <[email protected]>

simon-mo reviewed Nov 27, 2024

View reviewed changes

benchmarks/benchmark_guided.py Show resolved Hide resolved

benchmarks/benchmark_guided.py Outdated Show resolved Hide resolved

xuechendi added 2 commits November 27, 2024 19:47

fix warmup per suggestion

54678bb

Signed-off-by: Chendi Xue <[email protected]>

using describe for request latency

4196660

Signed-off-by: Chendi Xue <[email protected]>

aarnphm mentioned this pull request Nov 29, 2024

[Core][Performance] Add XGrammar support for guided decoding and set it as default #10785

Merged

mgoin approved these changes Dec 3, 2024

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 3, 2024

aarnphm approved these changes Dec 3, 2024

View reviewed changes

mgoin merged commit 381ac93 into vllm-project:main Dec 4, 2024
45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Benchmark structured output with datasets #10557

[Benchmark] Benchmark structured output with datasets #10557

xuechendi commented Nov 22, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 22, 2024

aarnphm left a comment

xuechendi commented Nov 26, 2024

simon-mo left a comment

xuechendi commented Nov 27, 2024

xuechendi commented Nov 27, 2024

mgoin left a comment

[Benchmark] Benchmark structured output with datasets #10557

[Benchmark] Benchmark structured output with datasets #10557

Conversation

xuechendi commented Nov 22, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 22, 2024

aarnphm left a comment

Choose a reason for hiding this comment

xuechendi commented Nov 26, 2024

simon-mo left a comment

Choose a reason for hiding this comment

xuechendi commented Nov 27, 2024

xuechendi commented Nov 27, 2024

mgoin left a comment

Choose a reason for hiding this comment

xuechendi commented Nov 22, 2024 •

edited by github-actions bot

Loading