[Core] Do async init of xgrammar in the engine #10871
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR does two key things:
Remove logit processor initialization from the client side and move it back
to its original spot within the engine. This allows us to avoid having to
serialize the logits processors and send them over zmq to the worker.
Perform the xgrammar initialization asynchronously in a thread. XGrammar
releases the Python GIL during this process, so the worker will continue in
parallel. This allows the first forward pass to proceed while initialization
occurs.
This is a draft until PR #10576 is complee, updating outlines to a new version
that is much more performant. Otherwise, performance with outlines will be
unacceptable with these changes.
Here are some test results on a system with NVIDIA L4 GPUs from before and after
these changes.
0d07f68 MQ engine: remove guided decoding init from the client
aba9688 xgrammar: run grammar compilation async
commit 0d07f68
Author: Mark McLoughlin [email protected]
Date: Mon Nov 25 13:59:04 2024 -0500
commit aba9688
Author: Russell Bryant [email protected]
Date: Mon Dec 2 17:58:16 2024 +0000