-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Benchmark] Benchmark structured output with datasets #10557
[Benchmark] Benchmark structured output with datasets #10557
Conversation
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny comment.
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
6534b9d
to
d67fc48
Compare
@simon-mo , please help to take a review |
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question regarding the measure of correctness for xgrammar dataset. Do we expect the model to be able to fully return 100% the correct output? Should we evaluate based on whether it matches the JSON schema instead of the content?
Hello, @simon-mo The correctness check only based on if the format can be successfully parsed according to the format type: And The reason you didn't see 100% in my example output is because I used '--guided-decoding-ratio 0.5', meaning 50% requests are guided_decoding request and 50% are using regular decoding. And regular decoding sometime can't generate json properly. |
Hi, @simon-mo , I enabled warmup for non-xgrammar-dataset, here is the results:
Using single json schema + skip warmup
|
Signed-off-by: Chendi Xue <[email protected]>
Signed-off-by: Chendi Xue <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is in a good place to land as a base for development, especially considering we have been using it in all our xgrammar PRs :)
It will be great to have a serving benchmark soon after this so we can sweep QPS rates rather than having offline batching
…0557) Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: Aaron Pham <[email protected]>
…0557) Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: Aaron Pham <[email protected]>
Add structure output benchmark.
Base PR: #10046
Additional work:
How to test:
How to test:
Expected output
FileName: 1.0guided_Llama-3.2-3B-Instruct_xgrammar_bench_10_out512_asyncTrue_warmupTrue_chunkedprefillTrue.txt