[V1] Implement Cascade Attention #11635

WoosukKwon · 2024-12-30T16:14:23Z

This PR implements a simple version of Cascade Attention. Cascade attention can save the HBM bandwidth for reading KV cache when requests share the same prefix.

NOTE: For simplicity, this PR only uses cascade attention when every running request shares the same KV cache. If one or more requests do not share the KV cache, cascade attention is not used.

Signed-off-by: Woosuk Kwon <[email protected]>

github-actions · 2024-12-30T16:14:37Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Woosuk Kwon <[email protected]>

comaniac

Overall LGTM. The implementation is pretty clean. One optimization we can do in the future is having a wrapper CUDA kernel in flash attention so that we can save kernel invocation overheads. This is how FlashInfer does as well.

@raywanb since you've worked on FlashInfer cascade kernel integration in v0, can you also take a look at the current v1 interface and see if this interface can integrate FlashInfer easily in the future?

vllm/v1/core/kv_cache_manager.py

vllm/v1/worker/gpu_model_runner.py

vllm/v1/attention/backends/flash_attn.py

raywanb · 2024-12-31T03:21:12Z

Overall LGTM. The implementation is pretty clean. One optimization we can do in the future is having a wrapper CUDA kernel in flash attention so that we can save kernel invocation overheads. This is how FlashInfer does as well.

@raywanb since you've worked on FlashInfer cascade kernel integration in v0, can you also take a look at the current v1 interface and see if this interface can integrate FlashInfer easily in the future?

Yes, seems pretty straightforward to integrate Flashinfer!

Signed-off-by: Woosuk Kwon <[email protected]>

comaniac

LGTM. Just two minor questions.
Also do you have a unit test to verify the correctness?

vllm/v1/core/kv_cache_manager.py

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Woosuk Kwon <[email protected]>

comaniac

Thanks for adding this! I'll probably try to benchmark this feature next week if I got a chance.

Signed-off-by: Woosuk Kwon <[email protected]>

[V1] Implement Cascade Inference

bf06942

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon requested review from robertgshaw2-neuralmagic, njhill, ywang96, comaniac, alexm-neuralmagic and tlrmchlsmth as code owners December 30, 2024 16:14

mergify bot added the ci/build label Dec 30, 2024

Minor

4faac41

Signed-off-by: Woosuk Kwon <[email protected]>

comaniac reviewed Dec 30, 2024

View reviewed changes

WoosukKwon added 7 commits December 31, 2024 03:08

Merge branch 'main' into v1-cascade

012775f

Minor

8093b2e

Signed-off-by: Woosuk Kwon <[email protected]>

Merge branch 'main' into v1-cascade

21c988d

isort

2dc2531

Signed-off-by: Woosuk Kwon <[email protected]>

Comment

910752e

Signed-off-by: Woosuk Kwon <[email protected]>

Minor

c8b32de

Signed-off-by: Woosuk Kwon <[email protected]>

minor

42efe0d

Signed-off-by: Woosuk Kwon <[email protected]>

comaniac reviewed Jan 1, 2025

View reviewed changes

vllm/v1/core/kv_cache_manager.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

WoosukKwon added 10 commits December 31, 2024 18:13

comment

ca7b756

Signed-off-by: Woosuk Kwon <[email protected]>

minor

58af494

Signed-off-by: Woosuk Kwon <[email protected]>

comment

d6a7daf

Signed-off-by: Woosuk Kwon <[email protected]>

Minor

d802be9

Signed-off-by: Woosuk Kwon <[email protected]>

docstring

afe8af7

Signed-off-by: Woosuk Kwon <[email protected]>

comment

3766125

Signed-off-by: Woosuk Kwon <[email protected]>

Fix

801b521

Signed-off-by: Woosuk Kwon <[email protected]>

Consider prefix only

1dfd2d4

Signed-off-by: Woosuk Kwon <[email protected]>

comment

c47a449

Signed-off-by: Woosuk Kwon <[email protected]>

Minor

34da6dd

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon added 3 commits January 1, 2025 00:41

Add debug

03a2809

Signed-off-by: Woosuk Kwon <[email protected]>

Fix

bf94bfa

Signed-off-by: Woosuk Kwon <[email protected]>

Add kernel test

350de8a

Signed-off-by: Woosuk Kwon <[email protected]>

comaniac approved these changes Jan 1, 2025

View reviewed changes

Add e2e test

8b3291d

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 1, 2025

WoosukKwon merged commit 7300144 into main Jan 1, 2025
89 of 92 checks passed

WoosukKwon deleted the v1-cascade branch January 1, 2025 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Implement Cascade Attention #11635

[V1] Implement Cascade Attention #11635

WoosukKwon commented Dec 30, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 30, 2024

comaniac left a comment

raywanb commented Dec 31, 2024

comaniac left a comment

comaniac left a comment

[V1] Implement Cascade Attention #11635

[V1] Implement Cascade Attention #11635

Conversation

WoosukKwon commented Dec 30, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 30, 2024

comaniac left a comment

Choose a reason for hiding this comment

raywanb commented Dec 31, 2024

comaniac left a comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

WoosukKwon commented Dec 30, 2024 •

edited by github-actions bot

Loading