[Bugfix] Use .clone() for sampling params and deepcopy XGrammarLogitsProcessor #11380

tjohnson31415 · 2024-12-20T17:42:26Z

XGrammar is now the default decoding backend, but it breaks when doing parallel decoding with n>1 in Server mode. There is internal state in self.prefilled and per-sequence state in the matchers, but parallel decoding uses the same XGrammarLogitsProcessor instance for every parallel sequence.

The fix here for prefilled is just to remove the flag and use the length of input_ids as the check to avoid an IndexError: tuple index out of range. The fix for the matchers state is to deepcopy the processor for each sequence instead of sharing the reference to prevent an AssertionError at assert self.matchers[i].accept_token(sampled_token).

NOTE: I noticed the batch_size field on the processor and the creation of mulitple matchers based on that, but those don't seem to be used (i.e. batch_size==1 even if n>1). I'm not sure if the "batch" was intended to handle parallel decoding instead of doing a deepcopy like I do in this fix, but I also didn't see a good way to index into the batch based on the sequence in the sequence group.

DRAFT: Looking to add a test that would have caught this and I'm looking to understand the differences between LP processing for server mode vs oflfine mode.

FIX #11312

Signed-off-by: Travis Johnson <[email protected]>

github-actions · 2024-12-20T17:42:38Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

tjohnson31415 · 2025-01-02T16:24:18Z

Closing since #11637 was merged with a fix.

tjohnson31415 added 2 commits December 20, 2024 10:02

fix: deepcopy xgrammar LogitsProcessor for parallel decoding

d021f0f

Signed-off-by: Travis Johnson <[email protected]>

fix: use clone instead of deepcopy for params

66e4127

Signed-off-by: Travis Johnson <[email protected]>

tjohnson31415 closed this Jan 2, 2025

tjohnson31415 deleted the fix-xgrammar-parallel-decoding branch January 2, 2025 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Use .clone() for sampling params and deepcopy XGrammarLogitsProcessor #11380

[Bugfix] Use .clone() for sampling params and deepcopy XGrammarLogitsProcessor #11380

tjohnson31415 commented Dec 20, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 20, 2024

tjohnson31415 commented Jan 2, 2025

[Bugfix] Use .clone() for sampling params and deepcopy XGrammarLogitsProcessor #11380

[Bugfix] Use .clone() for sampling params and deepcopy XGrammarLogitsProcessor #11380

Conversation

tjohnson31415 commented Dec 20, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 20, 2024

tjohnson31415 commented Jan 2, 2025

tjohnson31415 commented Dec 20, 2024 •

edited by github-actions bot

Loading