[Core][Performance] Add XGrammar support for guided decoding and set it as default #10785

aarnphm · 2024-11-29T23:45:01Z

Add initial support for XGrammar for V0 and makes it the default for grammar and json usage. Written in collaboration with @mgoin

I'm using the benchmark scripts from #10557

Results for using XGrammar as backend:

Throughput: 0.94 requests/s, 1022.46 total tokens/s, 480.27 output tokens/s Correct rate is 100.0 %
First token latency(msecs):
count      10.000000
mean     4552.206317
std       734.671745
min      3289.774953
25%      3864.269087
50%      5102.686635
75%      5102.717258
max      5114.346570
dtype: float64
Next token latency(msecs):
count    10.000000
mean     11.906452
std       1.409063
min      10.831970
25%      10.837367
50%      10.854235
75%      13.227200
max      14.325024
dtype: float64

Comparing to outlines

Throughput: 0.22 requests/s, 241.22 total tokens/s, 113.31 output tokens/s Correct rate is 100.0 %
First token latency(msecs):
count       10.000000
mean     38533.083248
std         35.807892
min      38491.813741
25%      38491.826321
50%      38556.601226
75%      38556.628519
max      38568.547848
dtype: float64
Next token latency(msecs):
count    10.000000
mean     12.955556
std       0.042220
min      12.901755
25%      12.914099
50%      12.953058
75%      12.996646
max      13.003127
dtype: float64

NOTE: Running on A100 80GB, with Llama 3.2 3B with chunked prefill enable and JSON grammar

Signed-off-by: Aaron Pham <[email protected]>

github-actions · 2024-11-29T23:45:12Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Aaron Pham <[email protected]>

cleaned up version of this pr vllm-project#10785 https://arxiv.org/pdf/2411.15100

Ubospica

Thanks for your contribution to integrating XGrammar into vLLM! It overall looks good, but there are some minor points to enhance parallelism.

Ubospica · 2024-11-30T14:40:54Z

vllm/model_executor/guided_decoding/__init__.py

-        guided_params: GuidedDecodingParams,
-        tokenizer) -> Optional[LogitsProcessor]:
+        guided_params: GuidedDecodingParams, tokenizer: PreTrainedTokenizer,
+        model_config: ModelConfig) -> LogitsProcessor | None:
    # CFG grammar not supported by LMFE, so we use outlines instead
    if guided_params.backend == 'outlines' or guided_params.grammar:


XGrammar can also do grammar decoding and accelerate it. The grammar formats for XGrammar and Outlines are different. XGrammar uses GBNF format, while Outlines uses lark grammar. That might be documented.

i see, I will add this difference into the docs

I think we should just remove the grammar check here.

If user send grammar they should also specify the backend (probably better to document the cartesian product of the combinations)

vllm/model_executor/guided_decoding/xgrammar_decoding.py

Essentially a cleaned up version of this `pr`: vllm-project#10785 Especially since `outlines` is rather slow and the new version is though to intergrate as they do not focus on being pickleable which is a key feature for us using the multiprocessing engine: dottxt-ai/outlines-core#99 I assume more and more will change over to `xgrammar`. This is a minimum implementation. https://arxiv.org/pdf/2411.15100 Signed-off-by: Jannis Schönleber <[email protected]>

Signed-off-by: Aaron Pham <[email protected]>

Signed-off-by: mgoin <[email protected]>

…feat/xgrammar

mgoin · 2024-12-01T05:31:13Z

Updated this PR with caches for the tokenizer data and the grammar compiler to avoid constructing these data structures for each request. It isn't pretty but it boosts throughput by about 1.4x.

I need to perform more profiling but we are limited by the required-serialization architecture that we currently have. We plan to move the FSM initialization out of the frontend to both simplify the implementation and speed up TTFT.

Setup: Llama-3.1-8B-Instruct, 1xH100

Command:

python benchmark_guided.py --model meta-llama/Llama-3.1-8B-Instruct --dataset xgrammar_bench --async-engine --output-len 512 --num-prompts 20 --enable-chunked-prefill --guided-decoding-ratio 1

Before:

Throughput: 1.46 requests/s, 1189.12 total tokens/s, 748.00 output tokens/s Correct rate is 95.0 % 
First token latency(msecs):
count      20.000000
mean     7180.142369
std      1212.973158
min      4644.173431
25%      7012.610644
50%      7578.541221
75%      8079.524654
max      8092.886029
dtype: float64
Next token latency(msecs):
count    20.000000
mean     12.662371
std       2.336552
min      10.942158
25%      10.942283
50%      11.864077
75%      12.990130
max      17.550802
dtype: float64

After:

Throughput: 2.12 requests/s, 1726.67 total tokens/s, 1086.13 output tokens/s Correct rate is 95.0 % 
First token latency(msecs):
count      20.000000
mean     3254.682581
std       290.516334
min      2869.083916
25%      2869.120228
50%      3449.280638
75%      3477.460549
max      3477.504314
dtype: float64
Next token latency(msecs):
count    20.000000
mean     12.054585
std       0.550868
min      11.643879
25%      11.643967
50%      11.674903
75%      12.786106
max      12.786302
dtype: float64

Essentially a cleaned up version of this `pr`: vllm-project#10785 Especially since `outlines` is rather slow and the new version is though to intergrate as they do not focus on being pickleable which is a key feature for us using the multiprocessing engine: dottxt-ai/outlines-core#99 I assume more and more will change over to `xgrammar`. This is a minimum implementation. https://arxiv.org/pdf/2411.15100 Signed-off-by: Jannis Schönleber <[email protected]>

mgoin · 2024-12-02T19:40:56Z

@Ubospica do you know when XGrammar can support regex? This would help with covering existing use cases

joennlae · 2024-12-02T22:25:37Z

@mgoin I added a pull request yesterday that adds some simple regex pattern + integer ranges support:

mlc-ai/xgrammar#106

Signed-off-by: mgoin <[email protected]>

vllm/entrypoints/llm.py

if isinstance(params, Sequence) else copy.copy(params), is actually a blocking review. We can only introduce it if it is not perf regression.

Signed-off-by: mgoin <[email protected]>

mgoin · 2024-12-03T02:36:29Z

Thanks for review @simon-mo I moved the copy into a specific if sampling_params.guided_decoding is not None case - ready for re-review

Signed-off-by: Aaron Pham <[email protected]>

hmellor · 2024-12-03T14:23:26Z

The new dependency in this PR appears to have broken installation on ARM

8.373 ERROR: Could not find a version that satisfies the requirement xgrammar (from versions: none)
8.419 ERROR: No matching distribution found for xgrammar
------
Dockerfile.arm:37
--------------------
  36 |     
  37 | >>> RUN --mount=type=cache,target=/root/.cache/pip \
  38 | >>>     --mount=type=bind,src=requirements-common.txt,target=requirements-common.txt \
  39 | >>>     --mount=type=bind,src=requirements-cpu.txt,target=requirements-cpu.txt \
  40 | >>>     pip install -v -r requirements-cpu.txt
  41 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c pip install -v -r requirements-cpu.txt" did not complete successfully: exit code: 1

mgoin · 2024-12-03T15:14:44Z

Thanks for reporting @hmellor indeed it seems there isn't a manylinux arm wheel available https://pypi.org/project/xgrammar/#files

I'll work on a patch fix

stefanobranco · 2024-12-03T21:22:59Z

Obviously super cool to see new integrations, but it does seem a bit hasty to me to immediately change the default? The implementation with outlines core should be able to close the gap after all, and this one does not support regex yet. Or is xgrammar just objectively better?

joennlae · 2024-12-03T21:31:54Z

I second this opinion. Currently, the same behaviour cannot be expected from 'grammar`. I added a simple PR with some rudimentary regex + integer range support (mlc-ai/xgrammar#106).

I can attest that it is much faster, especially if one uses dynamic schemas. However, we should use outlines as the default, as it supports more cases for now, and the change is not breaking for many.

I introduced it as an option in my closed PR (#10803). But I forgot it when I discussed it with @mgoin.

mgoin · 2024-12-03T21:43:53Z

Hi @stefanobranco and @joennlae thanks for raising your concern. Our primary concern is immediately improving structured output performance where it is easy to do so while maintaining the same behavior. With xgrammar as the default in supported cases, we still fallback to outlines in several cases covered here https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/guided_decoding/__init__.py#L18-L48

Please let me know if a case isn't being accounted for that is affecting your usage. We do not want to change external behavior. We have several integration tests that I have been using to create these rules, but more test points are certainly welcome!

We have several fast-followup items to reduce the special cases around using xgrammar and improving performance even further in V0. We are also working on enabling outlines>=0.1.8 support with the devs of that project. Then of course we will enable the usage of structured output in V1.

I hope this is helpful context and we will work on making a public roadmap for longer term goals. Please join the #feat-structured-output channel in slack if you want to have more direct discussion with the people working on this.

Ubospica · 2024-12-05T05:16:26Z

Thanks @stefanobranco, @joennlae, @@mgoin for great feedbacks.

The first initial release of XGrammar focuses on performance across grammar and json schema. We would like to ensure the system is holistically design to ensure zero overhead structure output, which aligns with many users needs we also see.

Now that initial release land, we are working full steam to enable full support for JSON schema and regex. Thank you for these great feedbacks and please feel free to open new issues on XGrammar to give us feedbacks.

Our general mission is to enable bringing flexible, zero-overhead structured generation everywhere, and we are excited to work with the community here to achieve that mission together, thank you for these feedbacks and we love contributions and collaborations to bring better, zero-overhead structured output for everyone

…it as default (vllm-project#10785) Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>

ktrapeznikov · 2024-12-19T14:51:26Z

will this support models that use mistral tokenizers?

…it as default (vllm-project#10785) Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>

aarnphm added 3 commits November 29, 2024 22:53

--wip--

41c0031

Signed-off-by: Aaron Pham <[email protected]>

fix: update workaround for pickling

c17da0b

Signed-off-by: Aaron Pham <[email protected]>

hack: hmm it is a tuple

b29dfb3

Signed-off-by: Aaron Pham <[email protected]>

aarnphm requested review from zhuohan123, youkaichao, alexm-neuralmagic, comaniac and njhill as code owners November 29, 2024 23:45

mergify bot added documentation Improvements or additions to documentation ci/build labels Nov 29, 2024

revert: bad merge

1be065b

Signed-off-by: Aaron Pham <[email protected]>

aarnphm marked this pull request as draft November 29, 2024 23:46

aarnphm added 2 commits November 30, 2024 00:08

fix: correct use apply_token_bitmask interface

ee8e796

Signed-off-by: Aaron Pham <[email protected]>

fix: correctness for prefill

cef4201

Signed-off-by: Aaron Pham <[email protected]>

aarnphm marked this pull request as ready for review November 30, 2024 00:16

aarnphm added 3 commits November 30, 2024 00:23

fix: lint error

919e5f8

Signed-off-by: Aaron Pham <[email protected]>

fix: annotations

4d6585b

Signed-off-by: Aaron Pham <[email protected]>

fix: format

5d2a43c

Signed-off-by: Aaron Pham <[email protected]>

joennlae added a commit to 44ai-labs/vllm that referenced this pull request Nov 30, 2024

feat(guided): add xgrammar as guided generation provider

c7df26c

cleaned up version of this pr vllm-project#10785 https://arxiv.org/pdf/2411.15100

Ubospica reviewed Nov 30, 2024

View reviewed changes

joennlae mentioned this pull request Dec 1, 2024

[Core] add xgrammar as guided generation provider #10803

Closed

aarnphm and others added 3 commits November 30, 2024 20:54

chore: remove grammar mode branch with outlines

3770400

Signed-off-by: Aaron Pham <[email protected]>

Add caching for tokenizer data and grammar compiler

865e2a3

Signed-off-by: mgoin <[email protected]>

Merge branch 'feat/xgrammar' of https://github.com/aarnphm/vllm into …

e5684e2

…feat/xgrammar

dongxiaolong mentioned this pull request Dec 2, 2024

[Feature]: Integrate with XGrammar for zero-overhead structured generation in LLM inference. #10660

Closed

1 task

mgoin changed the title ~~feat(guided): xgrammar support~~ [Core][Performance] Add XGrammar support for guided decoding Dec 2, 2024

mgoin added 3 commits December 2, 2024 22:54

Fix tests and support json_object

8962301

Signed-off-by: mgoin <[email protected]>

Fix test

8d3c671

Merge branch 'main' into feat/xgrammar

9f97093

mgoin requested review from DarkLight1337, robertgshaw2-neuralmagic and simon-mo as code owners December 2, 2024 22:56

mergify bot added the frontend label Dec 2, 2024

simon-mo changed the title ~~[Core][Performance] Add XGrammar support for guided decoding~~ [Core][Performance] Add XGrammar support for guided decoding and set it as default Dec 3, 2024

simon-mo previously approved these changes Dec 3, 2024

View reviewed changes

vllm/entrypoints/llm.py Outdated Show resolved Hide resolved

Move copy down into guided decoding case

975e040

Signed-off-by: mgoin <[email protected]>

aarnphm added 2 commits December 2, 2024 22:11

chore: fix coallesce type

59221e6

Signed-off-by: Aaron Pham <[email protected]>

chore: add notes for performance

5f49734

Signed-off-by: Aaron Pham <[email protected]>

aarnphm force-pushed the feat/xgrammar branch from 4ee464a to 5f49734 Compare December 3, 2024 03:16

simon-mo approved these changes Dec 3, 2024

View reviewed changes

DarkLight1337 merged commit 9323a31 into vllm-project:main Dec 3, 2024
73 checks passed

mgoin mentioned this pull request Dec 3, 2024

[Bugfix] Only require XGrammar on x86 #10865

Merged

mgoin mentioned this pull request Dec 4, 2024

[Bugfix] Fallback to outlines for complex json schemas #10899

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Performance] Add XGrammar support for guided decoding and set it as default #10785

[Core][Performance] Add XGrammar support for guided decoding and set it as default #10785

aarnphm commented Nov 29, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 29, 2024

Ubospica left a comment

Ubospica Nov 30, 2024 •

edited

Loading

aarnphm Dec 1, 2024

aarnphm Dec 1, 2024

mgoin commented Dec 1, 2024 •

edited

Loading

mgoin commented Dec 2, 2024

joennlae commented Dec 2, 2024

mgoin commented Dec 3, 2024

hmellor commented Dec 3, 2024

mgoin commented Dec 3, 2024

stefanobranco commented Dec 3, 2024

joennlae commented Dec 3, 2024

mgoin commented Dec 3, 2024 •

edited

Loading

Ubospica commented Dec 5, 2024 •

edited

Loading

ktrapeznikov commented Dec 19, 2024

[Core][Performance] Add XGrammar support for guided decoding and set it as default #10785

[Core][Performance] Add XGrammar support for guided decoding and set it as default #10785

Conversation

aarnphm commented Nov 29, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 29, 2024

Ubospica left a comment

Choose a reason for hiding this comment

Ubospica Nov 30, 2024 • edited Loading

Choose a reason for hiding this comment

aarnphm Dec 1, 2024

Choose a reason for hiding this comment

aarnphm Dec 1, 2024

Choose a reason for hiding this comment

mgoin commented Dec 1, 2024 • edited Loading

mgoin commented Dec 2, 2024

joennlae commented Dec 2, 2024

mgoin commented Dec 3, 2024

hmellor commented Dec 3, 2024

mgoin commented Dec 3, 2024

stefanobranco commented Dec 3, 2024

joennlae commented Dec 3, 2024

mgoin commented Dec 3, 2024 • edited Loading

Ubospica commented Dec 5, 2024 • edited Loading

ktrapeznikov commented Dec 19, 2024

aarnphm commented Nov 29, 2024 •

edited by github-actions bot

Loading

Ubospica Nov 30, 2024 •

edited

Loading

mgoin commented Dec 1, 2024 •

edited

Loading

mgoin commented Dec 3, 2024 •

edited

Loading

Ubospica commented Dec 5, 2024 •

edited

Loading