[Bugfix] Fix for Spec model TP + Chunked Prefill #10232

andoorve · 2024-11-11T20:32:03Z

Fixes the issue I raised here: #9291. Chunked prefill + spec decoding + TP on the spec model fails for me with KeyError: 'num_seq_groups' when I used the following command.

vllm serve meta-llama/Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --max-num-seqs 32  --block-size 32  --speculative-model meta-llama/Llama-3.1-8B-Instruct  --num-speculative-tokens 8 --gpu-memory-utilization  0.98 --use-v2-block-manager --distributed-executor-backend ray --enable-chunked-prefill --max-num-batched-tokens 4096 --max-model-len 32768

This fix makes it so the proposer only runs once on the non driver processes when no_spec is on to match the driver.

One thing that is still confusing is I would expect this issue to show up without chunked prefill as well. Unsure why it doesn't show up in that case. Would be good to get an opinion from someone more familiar with spec decode path.

FIX #10276

github-actions · 2024-11-11T20:32:15Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

andoorve · 2024-11-11T20:32:58Z

@NickLucche @sroy745

sroy745 · 2024-11-12T04:34:24Z

Hi,
Thanks for the fix.

Based on our DM discussions my understanding is that the main issue seems to be that even when all the sequences are prompts (only prefill) we have num_lookahead_slots as > 0. I added some logs in this pr (https://github.com/vllm-project/vllm/pull/10186/files) and the output if I run with and without chunked-prefill enabled is the following

Without chunked prefill

num_lookahead_slots in _schedule_default 0
prefills in _schedule_default_prefill 1
decodes in _schedule_default_prefill 0

With chunked prefill

num_lookahead_slots in _schedule_chunked_prefill 4
prefills in _schedule_chunked_prefill 1
decodes in _schedule_chunked_prefill 0

In without chunked-prefill run if it is a complete prefill batch num_lookahead_slots is set to 0 but it is not the case for the chunked-prefill run. I wonder if we should fix __schedule_chunked_prefill to set num_lookahead_slots to 0 if it is a complete prefill batch and add an assertion in spec_decode_worker for that?

NickLucche · 2024-11-12T07:26:37Z

I wonder if we should fix __schedule_chunked_prefill to set num_lookahead_slots to 0 if it is a complete prefill batch and add an assertion in spec_decode_worker for that

I like that, I think this would be more in line with the expected semantics (no speculation on prefills-only).

Thanks for looking into it!!

andoorve · 2024-11-12T19:43:34Z

As discussed over DM, moving this up to the scheduler level is a cleaner fix, moved the check there. @NickLucche @sroy745 PTAL if this logic looks good, then I'll mark this ready!

sroy745

Thanks for the PR! Added a couple of comments about tests. Logic LGTM

Thanks

vllm/core/scheduler.py

vllm/spec_decode/spec_decode_worker.py

comaniac · 2024-11-13T18:13:57Z

vllm/core/scheduler.py

            num_batched_tokens=budget.num_batched_tokens,
            blocks_to_swap_in=swapped_in.blocks_to_swap_in,
            blocks_to_swap_out=running_scheduled.blocks_to_swap_out,
            blocks_to_copy=running_scheduled.blocks_to_copy +
            swapped_in.blocks_to_copy,
            ignored_seq_groups=prefills.ignored_seq_groups +
            swapped_in.infeasible_seq_groups,
-            num_lookahead_slots=running_scheduled.num_lookahead_slots,
+            num_lookahead_slots=num_lookahead_slots,


@varun-sundar-rabindranath could you also review this part to see if this will break multi-step scheduling with chunked prefill?

Thanks for the Tag. I believe it will affect performance.
multi-step + chunked-prefill allows for having look-ahead slots even when all the sequences are prefills. The sequences are processed as prefills in step 1 and are processed as decodes in steps 2 - n.
Setting the lookahead_slots to 0, will force single stepping for the all-prefills case. I can get some profiles.

@andoorve is there a way to make this update only if spec decode is enabled ? I believe that would be safer.

Hi @varun-sundar-rabindranath I think that should be possible, thanks for the feedback! Let me see how we can do that

@varun-sundar-rabindranath @comaniac Can you check whether this condition makes sense?

Hey @andoorve - The condition looks good 👍

andoorve · 2024-11-19T19:32:06Z

Waiting on reviews from @sroy745 @varun-sundar-rabindranath @comaniac

andoorve · 2024-11-19T19:33:09Z

vllm/spec_decode/spec_decode_worker.py

@@ -653,6 +659,9 @@ def _run_non_driver_rank(self) -> bool:

        if not data["no_spec"]:
            self.scorer_worker.execute_model()
+            data = broadcast_tensor_dict(src=self._driver_rank)


Extra broadcast is not ideal. But since this is a bugfix and should be low impact perf wise may have to live with it.

Hi as discussed offline I will try to run a benchmark with and without this change to try and measure the impact of this if any.

I tried it with Llama 405B + 8B spec model @ 32k sequence length w/ speculative_tokens = 8 and TP 8 on H100.

The results below are for "prompt_tokens":30,"total_tokens":1054,"completion_tokens":1024 averaged over 3 runs.
Before Change: 9.321 s
After Change: 9.268 s

Speedup: 99.4%

Therefore, it slows down the regular speculative path very slightly.

cc: @njhill - Can we consider this acceptable as it is a necessary bugfix?

Thanks for running the comparison. Following up on our conversation yesterday I was wondering if we can piggyback on the existing broadcast that we do. For the cp + sd case we need to broadcast the additional information that there are prefills in the speculative batch and if so run the proposer worker. For that I was wondering if we can check for prompts in the input batch and based on that we set the 'run_spec_proposer' in the initial broadcast. Looking through the proposer code I think its doing something similar when deciding which sequences to use for decoding vs prefill (https://sourcegraph.com/github.com/vllm-project/vllm/-/blob/vllm/spec_decode/top1_proposer.py?L131) .

@NickLucche can we do the is_prompt check and use that to set the 'run_spec_proposer' field in the initial broadcast itself?

I don't know if it's sufficient to check for only prompts specifically. It doesn't go down that path when we try a single request scenario.

Thanks for taking a look. Can you please share some more details on what the issue is for the single request case?

Well, we only needed to add this extra broadcast after trying multiple requests. When we were simply sending single requests, this broadcast never came into play at all. So the condition is not simply "all prompts".

sroy745 · 2024-11-20T22:15:20Z

tests/core/test_chunked_prefill_scheduler.py

@@ -413,6 +413,39 @@ def cannot_append_second_group2(seq_group, num_lookahead_slots):
    assert out.num_batched_tokens == max_num_batched_tokens


+def test_chunked_prefill_spec_prefill():
+    """Verify preempt works with chunked prefill requests"""


nit - please update the comment to "Verify that the num_lookahead_slots is set appropriately for an all prefill batch depending on whether multi-step scheduling is enabled or not"?

sroy745 · 2024-11-20T22:15:56Z

tests/core/test_chunked_prefill_scheduler.py

+    assert out.num_prefill_groups == 1
+    assert seq_group.is_prefill()
+    assert out.num_batched_tokens == max_num_batched_tokens
+    assert out.num_lookahead_slots == 0


wondering if we can parameterize this test to run for both multi_step = True/False?

sroy745 · 2024-11-20T22:16:55Z

tests/spec_decode/e2e/test_integration_dist_tp2.py

+                                         per_test_common_llm_kwargs,
+                                         baseline_llm_kwargs, test_llm_kwargs,
+                                         batch_size: int, seed: int):
+    """Verify spec decode works well with smaller tp for draft models.


nit - Verify spec decode works well with draft models for tp > 1.

Changed this, thanks for the catch

sroy745 · 2024-11-21T01:17:33Z

vllm/spec_decode/spec_decode_worker.py

@@ -653,6 +659,9 @@ def _run_non_driver_rank(self) -> bool:

        if not data["no_spec"]:
            self.scorer_worker.execute_model()
+            data = broadcast_tensor_dict(src=self._driver_rank)


Thanks for running the comparison. Following up on our conversation yesterday I was wondering if we can piggyback on the existing broadcast that we do. For the cp + sd case we need to broadcast the additional information that there are prefills in the speculative batch and if so run the proposer worker. For that I was wondering if we can check for prompts in the input batch and based on that we set the 'run_spec_proposer' in the initial broadcast. Looking through the proposer code I think its doing something similar when deciding which sequences to use for decoding vs prefill (https://sourcegraph.com/github.com/vllm-project/vllm/-/blob/vllm/spec_decode/top1_proposer.py?L131) .

@NickLucche can we do the is_prompt check and use that to set the 'run_spec_proposer' field in the initial broadcast itself?

andoorve · 2024-11-22T22:17:57Z

vllm/spec_decode/spec_decode_worker.py

+            # the other for decodes. The variable indicates to the non-driver
+            # worker that there are prefills as part of the speculative batch
+            # and hence it needs to run an extra prefill forward pass.
+            run_spec_proposer_for_prefill=atleast_one_prompt,


It would be great if we got a sanity check from @NickLucche or someone!

on it, sorry for the late ack

njhill

Thanks for the great work @andoorve @sroy745 @NickLucche!

andoorve · 2024-11-25T19:27:27Z

Rebased, waiting for tests to pass to push

andoorve · 2024-11-26T00:30:23Z

I think there's a real error here:


[2024-11-25T20:26:10Z]         if all_prompt:
--
  | [2024-11-25T20:26:10Z] >           assert num_lookahead_slots == 0, (
  | [2024-11-25T20:26:10Z]                 "Prompt only runs should have num_lookahead_slots equal to 0. "
  | [2024-11-25T20:26:10Z]                 "This should never happen, please file a bug at "
  | [2024-11-25T20:26:10Z]                 "https://github.com/vllm-project/vllm/issues")
  | [2024-11-25T20:26:10Z] E           AssertionError: Prompt only runs should have num_lookahead_slots equal to 0. This should never happen, please file a bug at https://github.com/vllm-project/vllm/issues

FAILED spec_decode/test_spec_decode_worker.py::test_empty_input_batch[typical_acceptance_sampler-0-5]

@sroy745 maybe this is why the None part was necessary with all_prompt?

Signed-off-by: andoorve <[email protected]>

This reverts commit 6863d1f. Signed-off-by: andoorve <[email protected]>

Signed-off-by: andoorve <[email protected]>

Signed-off-by: Sourashis Roy <[email protected]>

Signed-off-by: andoorve <[email protected]>

NickLucche · 2024-11-26T11:19:38Z

Hey @andoorve thanks for the quick fix on the last hiccup!

I did take a look at test_empty_input_batch but I couldn't find any real use-case where we're sending empty batches to signal or set something.
Were you able to find some? Could be useful for reference to write it here imho.

PS we're still missing the DCO check to get the green lights

andoorve · 2024-11-26T17:08:41Z

@NickLucche No I didn't find any - I just included that quick fix based on what @sroy745 did previously.

DCO is on one of @sroy745's commits but he wasn't able to get it to work. I think when we squash and merge though it should be good to go and signed off by both.

@njhill Would you mind merging when you get a chance?

Signed-off-by: andoorve <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Co-authored-by: Sourashis Roy <[email protected]>

Signed-off-by: andoorve <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Co-authored-by: Sourashis Roy <[email protected]> Signed-off-by: Andrew Feldman <[email protected]>

Signed-off-by: andoorve <[email protected]> Signed-off-by: Sourashis Roy <[email protected]> Co-authored-by: Sourashis Roy <[email protected]>

andoorve requested review from cadedaniel, comaniac and njhill November 11, 2024 20:32

andoorve force-pushed the andoorve/spec-fix-chunked branch from f1ff8aa to 6863d1f Compare November 11, 2024 20:33

andoorve force-pushed the andoorve/spec-fix-chunked branch 2 times, most recently from d5f6392 to 10f69a4 Compare November 12, 2024 19:16

andoorve self-assigned this Nov 13, 2024

sroy745 reviewed Nov 13, 2024

View reviewed changes

vllm/core/scheduler.py Outdated Show resolved Hide resolved

vllm/core/scheduler.py Show resolved Hide resolved

vllm/spec_decode/spec_decode_worker.py Outdated Show resolved Hide resolved

comaniac reviewed Nov 13, 2024

View reviewed changes

mergify bot added the documentation Improvements or additions to documentation label Nov 13, 2024

andoorve added the bug Something isn't working label Nov 15, 2024

andoorve force-pushed the andoorve/spec-fix-chunked branch from 5893379 to 0b300d2 Compare November 15, 2024 03:06

andoorve marked this pull request as ready for review November 19, 2024 19:28

andoorve requested review from LiuXiaoxuanPKU, zhuohan123, youkaichao and alexm-neuralmagic as code owners November 19, 2024 19:28

andoorve requested review from sroy745, varun-sundar-rabindranath and comaniac November 19, 2024 19:28

andoorve commented Nov 19, 2024

View reviewed changes

sroy745 reviewed Nov 21, 2024

View reviewed changes

andoorve commented Nov 22, 2024

View reviewed changes

sroy745 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 22, 2024

njhill approved these changes Nov 25, 2024

View reviewed changes

andoorve and others added 19 commits November 26, 2024 05:52

Fix for Spec model TP + Chunked Prefill

f3a6ed5

Signed-off-by: andoorve <[email protected]>

Revert "Fix for Spec model TP + Chunked Prefill"

8aacb66

This reverts commit 6863d1f. Signed-off-by: andoorve <[email protected]>

Move fix to scheduler

086811e

Signed-off-by: andoorve <[email protected]>

Add assert

aaa7884

Signed-off-by: andoorve <[email protected]>

Small cleanup

a43b19b

Signed-off-by: andoorve <[email protected]>

Small cleanup

e7576f7

Signed-off-by: andoorve <[email protected]>

Typo fix

99d6e1b

Signed-off-by: andoorve <[email protected]>

Docs change

c2a0b81

Signed-off-by: andoorve <[email protected]>

Removed unnecessary checks

7b5cafd

Signed-off-by: andoorve <[email protected]>

E2E Test

5ca1bb8

Signed-off-by: andoorve <[email protected]>

Change condition to exclude multi step

18187d4

Signed-off-by: andoorve <[email protected]>

Add chunked prefill scheduler unit test

2d3d16f

Signed-off-by: andoorve <[email protected]>

Fix multiple batch chunked prefill + TP + spec

964e9f6

Signed-off-by: andoorve <[email protected]>

Format

b517bbe

Signed-off-by: andoorve <[email protected]>

Nits and add multi step test

c6127eb

Signed-off-by: andoorve <[email protected]>

Remove additional broadcast needed for proposer prefill pass

31b0ddf

Fix failing test

c232b84

Signed-off-by: Sourashis Roy <[email protected]>

Address Nick comments

2d99f39

Signed-off-by: andoorve <[email protected]>

Fix test failure

01b43aa

Signed-off-by: andoorve <[email protected]>

andoorve force-pushed the andoorve/spec-fix-chunked branch from e59bb79 to 01b43aa Compare November 26, 2024 05:52

njhill merged commit db66e01 into vllm-project:main Nov 26, 2024
51 checks passed

andoorve deleted the andoorve/spec-fix-chunked branch November 26, 2024 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix for Spec model TP + Chunked Prefill #10232

[Bugfix] Fix for Spec model TP + Chunked Prefill #10232

andoorve commented Nov 11, 2024 •

edited

Loading

github-actions bot commented Nov 11, 2024

andoorve commented Nov 11, 2024

sroy745 commented Nov 12, 2024 •

edited

Loading

NickLucche commented Nov 12, 2024

andoorve commented Nov 12, 2024

sroy745 left a comment •

edited

Loading

comaniac Nov 13, 2024

varun-sundar-rabindranath Nov 13, 2024

andoorve Nov 14, 2024

andoorve Nov 18, 2024

varun-sundar-rabindranath Nov 20, 2024

andoorve commented Nov 19, 2024

andoorve Nov 19, 2024

sroy745 Nov 19, 2024

andoorve Nov 20, 2024 •

edited

Loading

andoorve Nov 20, 2024

sroy745 Nov 21, 2024 •

edited

Loading

andoorve Nov 21, 2024

sroy745 Nov 21, 2024 •

edited

Loading

andoorve Nov 21, 2024

sroy745 Nov 20, 2024

andoorve Nov 21, 2024

sroy745 Nov 20, 2024

andoorve Nov 21, 2024

sroy745 Nov 20, 2024

andoorve Nov 21, 2024

sroy745 Nov 21, 2024 •

edited

Loading

andoorve Nov 22, 2024

NickLucche Nov 24, 2024

njhill left a comment

andoorve commented Nov 25, 2024

andoorve commented Nov 26, 2024

NickLucche commented Nov 26, 2024

andoorve commented Nov 26, 2024

[Bugfix] Fix for Spec model TP + Chunked Prefill #10232

[Bugfix] Fix for Spec model TP + Chunked Prefill #10232

Conversation

andoorve commented Nov 11, 2024 • edited Loading

github-actions bot commented Nov 11, 2024

andoorve commented Nov 11, 2024

sroy745 commented Nov 12, 2024 • edited Loading

NickLucche commented Nov 12, 2024

andoorve commented Nov 12, 2024

sroy745 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andoorve commented Nov 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andoorve Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sroy745 Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sroy745 Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sroy745 Nov 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njhill left a comment

Choose a reason for hiding this comment

andoorve commented Nov 25, 2024

andoorve commented Nov 26, 2024

NickLucche commented Nov 26, 2024

andoorve commented Nov 26, 2024

andoorve commented Nov 11, 2024 •

edited

Loading

sroy745 commented Nov 12, 2024 •

edited

Loading

sroy745 left a comment •

edited

Loading

andoorve Nov 20, 2024 •

edited

Loading

sroy745 Nov 21, 2024 •

edited

Loading

sroy745 Nov 21, 2024 •

edited

Loading

sroy745 Nov 21, 2024 •

edited

Loading