Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Benchmark] Benchmark structured output with datasets #10557

Merged
merged 16 commits into from
Dec 4, 2024

Conversation

xuechendi
Copy link
Contributor

@xuechendi xuechendi commented Nov 22, 2024

Add structure output benchmark.

Base PR: #10046
Additional work:

  1. add 4 guided options - 'grammar', 'choice', 'regex' and 'json'; For Json, add single-json and use 'xgrammar_benchmark'(multi-schema)
  2. add option for guided_decoding_ratio, default is 1.0, ratio < 1.0 provided a mixed requests containing both regular/guided
  3. add correctness rate check
  4. add first and next token latency when testing with AsyncEngine

How to test:

  • with guided decoding
python benchmarks/benchmark_guided.py --model meta-llama/Llama-3.2-3B-Instruct --dataset xgrammar_bench --async-engine --output-len 512 --num-prompts 10 --enable-chunked-prefill --guided-decoding-ratio 1.0 --save-results

How to test:

  • with no guided decoding
python benchmarks/benchmark_guided.py --model meta-llama/Llama-3.2-3B-Instruct --dataset xgrammar_bench --output-len 512 --num-prompts 10 --no-guided-decoding --save-results

Expected output
FileName: 1.0guided_Llama-3.2-3B-Instruct_xgrammar_bench_10_out512_asyncTrue_warmupTrue_chunkedprefillTrue.txt

    "elapsed_time": 50.00640656100586,
    "num_requests": 10,
    "total_num_tokens": 8086,
    "total_output_tokens": 5120,
    "requests_per_second": 0.1999743770390778,
    "tokens_per_second": "161.70",
    "output_tokens_per_second": "102.39",
    "correct_rate(%)": 100.0,
    "first_token_latency(msecs)": {
        "count": 10.0,
        "mean": 33074.16370075662,
        "std": 9578.487238858703,
        "min": 19958.579740021378,
        "25%": 25430.36330281757,
        "50%": 33561.18999654427,
        "75%": 40371.71240762109,
        "max": 46179.445307003334
    },
    "next_token_latency(msecs)": {
        "count": 10.0,
        "mean": 7.487070074518177,
        "std": 0.21345235857034978,
        "min": 7.0583598064933035,
        "25%": 7.452180709461126,
        "50%": 7.483904802198513,
        "75%": 7.609307728370618,
        "max": 7.772003888715707
    }

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Contributor

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny comment.

benchmarks/benchmark_guided.py Outdated Show resolved Hide resolved
@xuechendi xuechendi force-pushed the benchmark_structured_output branch from 6534b9d to d67fc48 Compare November 25, 2024 22:48
@xuechendi
Copy link
Contributor Author

@simon-mo , please help to take a review

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question regarding the measure of correctness for xgrammar dataset. Do we expect the model to be able to fully return 100% the correct output? Should we evaluate based on whether it matches the JSON schema instead of the content?

benchmarks/benchmark_guided.py Show resolved Hide resolved
benchmarks/benchmark_guided.py Outdated Show resolved Hide resolved
@xuechendi
Copy link
Contributor Author

I have a question regarding the measure of correctness for xgrammar dataset. Do we expect the model to be able to fully return 100% the correct output? Should we evaluate based on whether it matches the JSON schema instead of the content?

Hello, @simon-mo

The correctness check only based on if the format can be successfully parsed according to the format type:
https://github.com/vllm-project/vllm/pull/10557/files#diff-be4e291d6b3d1360bc13597125d2aec6cb3fa6231834655cfc70dbb0a531234eR279-R289

And The reason you didn't see 100% in my example output is because I used '--guided-decoding-ratio 0.5', meaning 50% requests are guided_decoding request and 50% are using regular decoding. And regular decoding sometime can't generate json properly.
If changing the '--guided-decoding-ratio' as 1 or default, you should expect 100% correctness.

@xuechendi
Copy link
Contributor Author

Hi, @simon-mo , I enabled warmup for non-xgrammar-dataset, here is the results:
Using single json schema + warmup

    "elapsed_time": 54.99144273600541,
    "num_requests": 128,
    "total_num_tokens": 139520,
    "total_output_tokens": 65536,
    "requests_per_second": 2.3276348761112344,
    "tokens_per_second": "2537.12",
    "output_tokens_per_second": "1191.75",
    "correct_rate(%)": 100.0,
    "first_token_latency(msecs)": {
        "count": 128.0,
        "mean": 14692.10625355845,
        "std": 7359.727861436729,
        "min": 1336.0014711506665,
        "25%": 8178.687368519604,
        "50%": 14545.75154162012,
        "75%": 21511.82376150973,
        "max": 27600.430445978418
    },
    "next_token_latency(msecs)": {
        "count": 128.0,
        "mean": 70.86720946193621,
        "std": 8.439623933853836,
        "min": 53.35845652065416,
        "25%": 63.384519656892735,
        "50%": 73.45476706154841,
        "75%": 77.98998092096099,
        "max": 81.07992057239244
    }

Using single json schema + skip warmup

    "elapsed_time": 88.94759224611335,
    "num_requests": 128,
    "total_num_tokens": 139520,
    "total_output_tokens": 65536,
    "requests_per_second": 1.4390496332473024,
    "tokens_per_second": "1568.56",
    "output_tokens_per_second": "736.79",
    "correct_rate(%)": 100.0,
    "first_token_latency(msecs)": {
        "count": 128.0,
        "mean": 48026.763826343085,
        "std": 7584.742728700046,
        "min": 36633.825331926346,
        "25%": 42391.85892988462,
        "50%": 47750.630808994174,
        "75%": 54306.403648806736,
        "max": 61219.33139488101
    },
    "next_token_latency(msecs)": {
        "count": 128.0,
        "mean": 73.12619282088733,
        "std": 9.379682187792639,
        "min": 54.036737062476334,
        "25%": 65.84560697577632,
        "50%": 75.79372507234439,
        "75%": 80.14213092189948,
        "max": 84.3030315213582
    }

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in a good place to land as a base for development, especially considering we have been using it in all our xgrammar PRs :)

It will be great to have a serving benchmark soon after this so we can sweep QPS rates rather than having offline batching

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 3, 2024
@mgoin mgoin merged commit 381ac93 into vllm-project:main Dec 4, 2024
45 checks passed
sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024
…0557)

Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Co-authored-by: Aaron Pham <[email protected]>
BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants