[Bugfix] Use .clone() for sampling params and deepcopy XGrammarLogitsProcessor #11380
+13
−12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
XGrammar is now the default decoding backend, but it breaks when doing parallel decoding with
n>1
in Server mode. There is internal state inself.prefilled
and per-sequence state in the matchers, but parallel decoding uses the sameXGrammarLogitsProcessor
instance for every parallel sequence.The fix here for
prefilled
is just to remove the flag and use the length of input_ids as the check to avoid anIndexError: tuple index out of range
. The fix for the matchers state is to deepcopy the processor for each sequence instead of sharing the reference to prevent anAssertionError
atassert self.matchers[i].accept_token(sampled_token)
.NOTE: I noticed the
batch_size
field on the processor and the creation of mulitple matchers based on that, but those don't seem to be used (i.e.batch_size==1
even ifn>1
). I'm not sure if the "batch" was intended to handle parallel decoding instead of doing a deepcopy like I do in this fix, but I also didn't see a good way to index into the batch based on the sequence in the sequence group.DRAFT: Looking to add a test that would have caught this and I'm looking to understand the differences between LP processing for server model vs oflfine mode.
FIX #11312