-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Performance] Add XGrammar support for guided decoding and set it as default #10785
Conversation
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
cleaned up version of this pr vllm-project#10785 https://arxiv.org/pdf/2411.15100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution to integrating XGrammar into vLLM! It overall looks good, but there are some minor points to enhance parallelism.
guided_params: GuidedDecodingParams, | ||
tokenizer) -> Optional[LogitsProcessor]: | ||
guided_params: GuidedDecodingParams, tokenizer: PreTrainedTokenizer, | ||
model_config: ModelConfig) -> LogitsProcessor | None: | ||
# CFG grammar not supported by LMFE, so we use outlines instead | ||
if guided_params.backend == 'outlines' or guided_params.grammar: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XGrammar can also do grammar decoding and accelerate it. The grammar formats for XGrammar and Outlines are different. XGrammar uses GBNF format, while Outlines uses lark grammar. That might be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, I will add this difference into the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just remove the grammar check here.
If user send grammar they should also specify the backend (probably better to document the cartesian product of the combinations)
Essentially a cleaned up version of this `pr`: vllm-project#10785 Especially since `outlines` is rather slow and the new version is though to intergrate as they do not focus on being pickleable which is a key feature for us using the multiprocessing engine: dottxt-ai/outlines-core#99 I assume more and more will change over to `xgrammar`. This is a minimum implementation. https://arxiv.org/pdf/2411.15100 Signed-off-by: Jannis Schönleber <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: mgoin <[email protected]>
Updated this PR with caches for the tokenizer data and the grammar compiler to avoid constructing these data structures for each request. It isn't pretty but it boosts throughput by about 1.4x. I need to perform more profiling but we are limited by the required-serialization architecture that we currently have. We plan to move the FSM initialization out of the frontend to both simplify the implementation and speed up TTFT. Setup: Llama-3.1-8B-Instruct, 1xH100 Command:
Before:
After:
|
Essentially a cleaned up version of this `pr`: vllm-project#10785 Especially since `outlines` is rather slow and the new version is though to intergrate as they do not focus on being pickleable which is a key feature for us using the multiprocessing engine: dottxt-ai/outlines-core#99 I assume more and more will change over to `xgrammar`. This is a minimum implementation. https://arxiv.org/pdf/2411.15100 Signed-off-by: Jannis Schönleber <[email protected]>
@Ubospica do you know when XGrammar can support regex? This would help with covering existing use cases |
@mgoin I added a pull request yesterday that adds some simple regex pattern + integer ranges support: |
if isinstance(params, Sequence) else copy.copy(params),
is actually a blocking review. We can only introduce it if it is not perf regression.
Signed-off-by: mgoin <[email protected]>
Thanks for review @simon-mo I moved the copy into a specific |
Signed-off-by: Aaron Pham <[email protected]>
Signed-off-by: Aaron Pham <[email protected]>
4ee464a
to
5f49734
Compare
The new dependency in this PR appears to have broken installation on ARM
|
Thanks for reporting @hmellor indeed it seems there isn't a manylinux arm wheel available https://pypi.org/project/xgrammar/#files I'll work on a patch fix |
Obviously super cool to see new integrations, but it does seem a bit hasty to me to immediately change the default? The implementation with outlines core should be able to close the gap after all, and this one does not support regex yet. Or is xgrammar just objectively better? |
I second this opinion. Currently, the same behaviour cannot be expected from 'grammar`. I added a simple PR with some rudimentary regex + integer range support (mlc-ai/xgrammar#106). I can attest that it is much faster, especially if one uses dynamic schemas. However, we should use I introduced it as an option in my closed PR (#10803). But I forgot it when I discussed it with @mgoin. |
Hi @stefanobranco and @joennlae thanks for raising your concern. Our primary concern is immediately improving structured output performance where it is easy to do so while maintaining the same behavior. With xgrammar as the default in supported cases, we still fallback to outlines in several cases covered here https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/guided_decoding/__init__.py#L18-L48 Please let me know if a case isn't being accounted for that is affecting your usage. We do not want to change external behavior. We have several integration tests that I have been using to create these rules, but more test points are certainly welcome! We have several fast-followup items to reduce the special cases around using xgrammar and improving performance even further in V0. We are also working on enabling outlines>=0.1.8 support with the devs of that project. Then of course we will enable the usage of structured output in V1. I hope this is helpful context and we will work on making a public roadmap for longer term goals. Please join the #feat-structured-output channel in slack if you want to have more direct discussion with the people working on this. |
Thanks @stefanobranco, @joennlae, @@mgoin for great feedbacks. The first initial release of XGrammar focuses on performance across grammar and json schema. We would like to ensure the system is holistically design to ensure zero overhead structure output, which aligns with many users needs we also see. Now that initial release land, we are working full steam to enable full support for JSON schema and regex. Thank you for these great feedbacks and please feel free to open new issues on XGrammar to give us feedbacks. Our general mission is to enable bringing flexible, zero-overhead structured generation everywhere, and we are excited to work with the community here to achieve that mission together, thank you for these feedbacks and we love contributions and collaborations to bring better, zero-overhead structured output for everyone |
…it as default (vllm-project#10785) Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>
will this support models that use mistral tokenizers? |
…it as default (vllm-project#10785) Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>
Add initial support for XGrammar for V0 and makes it the default for grammar and json usage. Written in collaboration with @mgoin
I'm using the benchmark scripts from #10557
Results for using XGrammar as backend:
Comparing to outlines
NOTE: Running on A100 80GB, with Llama 3.2 3B with chunked prefill enable and JSON grammar