-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix for Spec model TP + Chunked Prefill #10232
[Bugfix] Fix for Spec model TP + Chunked Prefill #10232
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
f1ff8aa
to
6863d1f
Compare
Hi, Based on our DM discussions my understanding is that the main issue seems to be that even when all the sequences are prompts (only prefill) we have num_lookahead_slots as > 0. I added some logs in this pr (https://github.com/vllm-project/vllm/pull/10186/files) and the output if I run with and without chunked-prefill enabled is the following Without chunked prefill
With chunked prefill
In without chunked-prefill run if it is a complete prefill batch num_lookahead_slots is set to 0 but it is not the case for the chunked-prefill run. I wonder if we should fix __schedule_chunked_prefill to set num_lookahead_slots to 0 if it is a complete prefill batch and add an assertion in spec_decode_worker for that? |
I like that, I think this would be more in line with the expected semantics (no speculation on prefills-only). Thanks for looking into it!! |
d5f6392
to
10f69a4
Compare
As discussed over DM, moving this up to the scheduler level is a cleaner fix, moved the check there. @NickLucche @sroy745 PTAL if this logic looks good, then I'll mark this ready! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Added a couple of comments about tests. Logic LGTM
Thanks
num_batched_tokens=budget.num_batched_tokens, | ||
blocks_to_swap_in=swapped_in.blocks_to_swap_in, | ||
blocks_to_swap_out=running_scheduled.blocks_to_swap_out, | ||
blocks_to_copy=running_scheduled.blocks_to_copy + | ||
swapped_in.blocks_to_copy, | ||
ignored_seq_groups=prefills.ignored_seq_groups + | ||
swapped_in.infeasible_seq_groups, | ||
num_lookahead_slots=running_scheduled.num_lookahead_slots, | ||
num_lookahead_slots=num_lookahead_slots, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@varun-sundar-rabindranath could you also review this part to see if this will break multi-step scheduling with chunked prefill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the Tag. I believe it will affect performance.
multi-step + chunked-prefill allows for having look-ahead slots even when all the sequences are prefills. The sequences are processed as prefills in step 1 and are processed as decodes in steps 2 - n.
Setting the lookahead_slots to 0, will force single stepping for the all-prefills case. I can get some profiles.
@andoorve is there a way to make this update only if spec decode is enabled ? I believe that would be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @varun-sundar-rabindranath I think that should be possible, thanks for the feedback! Let me see how we can do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@varun-sundar-rabindranath @comaniac Can you check whether this condition makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @andoorve - The condition looks good 👍
5893379
to
0b300d2
Compare
Waiting on reviews from @sroy745 @varun-sundar-rabindranath @comaniac |
@@ -653,6 +659,9 @@ def _run_non_driver_rank(self) -> bool: | |||
|
|||
if not data["no_spec"]: | |||
self.scorer_worker.execute_model() | |||
data = broadcast_tensor_dict(src=self._driver_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra broadcast is not ideal. But since this is a bugfix and should be low impact perf wise may have to live with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi as discussed offline I will try to run a benchmark with and without this change to try and measure the impact of this if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it with Llama 405B + 8B spec model @ 32k sequence length w/ speculative_tokens = 8 and TP 8 on H100.
The results below are for "prompt_tokens":30,"total_tokens":1054,"completion_tokens":1024
averaged over 3 runs.
Before Change: 9.321 s
After Change: 9.268 s
Speedup: 99.4%
Therefore, it slows down the regular speculative path very slightly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @njhill - Can we consider this acceptable as it is a necessary bugfix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running the comparison. Following up on our conversation yesterday I was wondering if we can piggyback on the existing broadcast that we do. For the cp + sd case we need to broadcast the additional information that there are prefills in the speculative batch and if so run the proposer worker. For that I was wondering if we can check for prompts in the input batch and based on that we set the 'run_spec_proposer' in the initial broadcast. Looking through the proposer code I think its doing something similar when deciding which sequences to use for decoding vs prefill (https://sourcegraph.com/github.com/vllm-project/vllm/-/blob/vllm/spec_decode/top1_proposer.py?L131) .
@NickLucche can we do the is_prompt check and use that to set the 'run_spec_proposer' field in the initial broadcast itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it's sufficient to check for only prompts specifically. It doesn't go down that path when we try a single request scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look. Can you please share some more details on what the issue is for the single request case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we only needed to add this extra broadcast after trying multiple requests. When we were simply sending single requests, this broadcast never came into play at all. So the condition is not simply "all prompts".
@@ -413,6 +413,39 @@ def cannot_append_second_group2(seq_group, num_lookahead_slots): | |||
assert out.num_batched_tokens == max_num_batched_tokens | |||
|
|||
|
|||
def test_chunked_prefill_spec_prefill(): | |||
"""Verify preempt works with chunked prefill requests""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - please update the comment to "Verify that the num_lookahead_slots is set appropriately for an all prefill batch depending on whether multi-step scheduling is enabled or not"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
assert out.num_prefill_groups == 1 | ||
assert seq_group.is_prefill() | ||
assert out.num_batched_tokens == max_num_batched_tokens | ||
assert out.num_lookahead_slots == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we can parameterize this test to run for both multi_step = True/False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
per_test_common_llm_kwargs, | ||
baseline_llm_kwargs, test_llm_kwargs, | ||
batch_size: int, seed: int): | ||
"""Verify spec decode works well with smaller tp for draft models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - Verify spec decode works well with draft models for tp > 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this, thanks for the catch
@@ -653,6 +659,9 @@ def _run_non_driver_rank(self) -> bool: | |||
|
|||
if not data["no_spec"]: | |||
self.scorer_worker.execute_model() | |||
data = broadcast_tensor_dict(src=self._driver_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running the comparison. Following up on our conversation yesterday I was wondering if we can piggyback on the existing broadcast that we do. For the cp + sd case we need to broadcast the additional information that there are prefills in the speculative batch and if so run the proposer worker. For that I was wondering if we can check for prompts in the input batch and based on that we set the 'run_spec_proposer' in the initial broadcast. Looking through the proposer code I think its doing something similar when deciding which sequences to use for decoding vs prefill (https://sourcegraph.com/github.com/vllm-project/vllm/-/blob/vllm/spec_decode/top1_proposer.py?L131) .
@NickLucche can we do the is_prompt check and use that to set the 'run_spec_proposer' field in the initial broadcast itself?
# the other for decodes. The variable indicates to the non-driver | ||
# worker that there are prefills as part of the speculative batch | ||
# and hence it needs to run an extra prefill forward pass. | ||
run_spec_proposer_for_prefill=atleast_one_prompt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we got a sanity check from @NickLucche or someone!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on it, sorry for the late ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work @andoorve @sroy745 @NickLucche!
Rebased, waiting for tests to pass to push |
I think there's a real error here:
@sroy745 maybe this is why the None part was necessary with all_prompt? |
Signed-off-by: andoorve <[email protected]>
This reverts commit 6863d1f. Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: Sourashis Roy <[email protected]>
Signed-off-by: andoorve <[email protected]>
Signed-off-by: andoorve <[email protected]>
e59bb79
to
01b43aa
Compare
Hey @andoorve thanks for the quick fix on the last hiccup! I did take a look at PS we're still missing the DCO check to get the green lights |
@NickLucche No I didn't find any - I just included that quick fix based on what @sroy745 did previously. DCO is on one of @sroy745's commits but he wasn't able to get it to work. I think when we squash and merge though it should be good to go and signed off by both. @njhill Would you mind merging when you get a chance? |
Signed-off-by: andoorve <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Co-authored-by: Sourashis Roy <[email protected]>
Signed-off-by: andoorve <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Co-authored-by: Sourashis Roy <[email protected]> Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: andoorve <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Co-authored-by: Sourashis Roy <[email protected]>
Fixes the issue I raised here: #9291. Chunked prefill + spec decoding + TP on the spec model fails for me with
KeyError: 'num_seq_groups'
when I used the following command.This fix makes it so the proposer only runs once on the non driver processes when
no_spec
is on to match the driver.One thing that is still confusing is I would expect this issue to show up without chunked prefill as well. Unsure why it doesn't show up in that case. Would be good to get an opinion from someone more familiar with spec decode path.
FIX #10276