Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Do async init of xgrammar in the engine #10871

Closed

Conversation

russellb
Copy link
Collaborator

@russellb russellb commented Dec 3, 2024

This PR does two key things:

  1. Remove logit processor initialization from the client side and move it back
    to its original spot within the engine. This allows us to avoid having to
    serialize the logits processors and send them over zmq to the worker.

  2. Perform the xgrammar initialization asynchronously in a thread. XGrammar
    releases the Python GIL during this process, so the worker will continue in
    parallel. This allows the first forward pass to proceed while initialization
    occurs.

This is a draft until PR #10576 is complee, updating outlines to a new version
that is much more performant. Otherwise, performance with outlines will be
unacceptable with these changes.

Here are some test results on a system with NVIDIA L4 GPUs from before and after
these changes.

$ python benchmarks/benchmark_guided.py --model meta-llama/Llama-3.2-1B-Instruct --dataset xgrammar_bench --async-engine --output-len 512 --num-prompts 20 --enable-chunked-prefill --guided-decoding-ratio 1


main @ 2f2cdc745a7a569637c58cfd5f6789c1d0741c84

Throughput: 1.70 requests/s, 1384.55 total tokens/s, 870.93 output tokens/s Correct rate is 95.0 % First token latency: 2069.74 msecs Next token latency: 18.92 msecs

Throughput: 1.70 requests/s, 1383.59 total tokens/s, 870.32 output tokens/s Correct rate is 95.0 % First token latency: 2048.76 msecs Next token latency: 18.98 msecs

Throughput: 1.70 requests/s, 1384.31 total tokens/s, 870.78 output tokens/s Correct rate is 95.0 % First token latency: 2037.78 msecs Next token latency: 18.99 msecs


xgrammar-init-in-engine-rebase branch

Throughput: 1.88 requests/s, 1528.67 total tokens/s, 961.58 output tokens/s Correct rate is 95.0 % First token latency: 1001.18 msecs Next token latency: 18.85 msecs

Throughput: 1.87 requests/s, 1523.59 total tokens/s, 958.39 output tokens/s Correct rate is 95.0 % First token latency: 1065.43 msecs Next token latency: 18.79 msecs

Throughput: 1.89 requests/s, 1536.15 total tokens/s, 966.29 output tokens/s Correct rate is 95.0 % First token latency: 1013.37 msecs Next token latency: 18.72 msecs

0d07f68 MQ engine: remove guided decoding init from the client
aba9688 xgrammar: run grammar compilation async

commit 0d07f68
Author: Mark McLoughlin [email protected]
Date: Mon Nov 25 13:59:04 2024 -0500

MQ engine: remove guided decoding init from the client

Currently with MQLLMEngine, we are initializing LogitsProcessors
on the client side, pickling the entire list of LogitsProcessors,
and sending them over ZeroMQ to the engine.

This was put in place so that the expensive initialization (tens
of second) of the Outlines LogitsProcessor could happen in a thread,
such that the client could defer submitting the request to the engine
until the initialization had completed.

This became an issue because recent (Rust-based) Outlines does not
support pickle serialization, but this has resolved by
dottxt-ai/outlines-core#99.

However, this approach is also not desirable in the case of
XGrammar because the initialization is not expensive (hundreds of
milliseconds) and the serialization is just unnecessary complexity.

And so, let's remove the code from the client side of MQLLMEngine
to special case the creation of logits_processors based on guided
decoding params. This will now happen on the engine side once
again.

Signed-off-by: Mark McLoughlin <[email protected]>

commit aba9688
Author: Russell Bryant [email protected]
Date: Mon Dec 2 17:58:16 2024 +0000

xgrammar: run grammar compilation async

This runs xgrammar compilation async in a way that's not very
invasive. Compilation will get kicked off when the logit processor is
first created, but blocking on its completion will not occur until
later, the first time it's needed.

Signed-off-by: Russell Bryant <[email protected]>

markmc and others added 2 commits December 3, 2024 17:56
Currently with MQLLMEngine, we are initializing LogitsProcessors
on the client side, pickling the entire list of LogitsProcessors,
and sending them over ZeroMQ to the engine.

This was put in place so that the expensive initialization (tens
of second) of the Outlines LogitsProcessor could happen in a thread,
such that the client could defer submitting the request to the engine
until the initialization had completed.

This became an issue because recent (Rust-based) Outlines does not
support pickle serialization, but this has resolved by
dottxt-ai/outlines-core#99.

However, this approach is also not desirable in the case of
XGrammar because the initialization is not expensive (hundreds of
milliseconds) and the serialization is just unnecessary complexity.

And so, let's remove the code from the client side of MQLLMEngine
to special case the creation of logits_processors based on guided
decoding params. This will now happen on the engine side once
again.

Signed-off-by: Mark McLoughlin <[email protected]>
This runs xgrammar compilation async in a way that's not very
invasive. Compilation will get kicked off when the logit processor is
first created, but blocking on its completion will not occur until
later, the first time it's needed.

Signed-off-by: Russell Bryant <[email protected]>
Copy link

github-actions bot commented Dec 3, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mgoin
Copy link
Member

mgoin commented Dec 3, 2024

This makes quite a large difference on Llama-3.1-8B-Instruct on H100:

python benchmark_guided.py --model meta-llama/Llama-3.1-8B-Instruct --dataset xgrammar_bench --async-engine --output-len 512 --num-prompts 32 --enable-chunked-prefill --guided-decoding-ratio 1

# Main @ 2f2cdc745a7a569637c58cfd5f6789c1d0741c84
Throughput: 2.66 requests/s, 2189.98 total tokens/s, 1363.66 output tokens/s Correct rate is 93.75 % 
First token latency(msecs):
count      32.000000
mean     4076.612419
std       707.871948
min      3046.303682
25%      3654.368758
50%      4330.442839
75%      4826.109441
max      4862.703992
dtype: float64
Next token latency(msecs):
count    32.000000
mean     15.479507
std       1.348600
min      13.995258
25%      14.047387
50%      14.993054
75%      16.284354
max      17.443944
dtype: float64

# This PR
Throughput: 4.19 requests/s, 3442.13 total tokens/s, 2143.35 output tokens/s Correct rate is 93.75 % 
First token latency(msecs):
count      32.000000
mean     1104.019488
std        64.567038
min      1012.124414
25%      1066.139331
50%      1119.639536
75%      1171.954522
max      1184.854695
dtype: float64
Next token latency(msecs):
count    32.000000
mean     12.748215
std       0.093481
min      12.639430
25%      12.645627
50%      12.726120
75%      12.803509
max      12.883286
dtype: float64

@russellb
Copy link
Collaborator Author

russellb commented Dec 4, 2024

While early results looked very promising, these changes no longer seemed to help once testing with other simpler optimizations in place. Using the online benchmark script from #10880, we can't justify the level of change introduced here.

Part of the benefit here was in the cleanup and removing the need for serialization and passing around additional data. We can continue pursuing further cleanup and optimization in V1.

@russellb russellb closed this Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants