[Kernel] Add CUTLASS sparse support with argument sweep, heuristics, and torch operators #10335

Faraz9877 · 2024-11-14T15:57:15Z

Add Cutlass 2:4 Sparsity Support for Faster LLM Inference

Implements NVIDIA Cutlass 2:4 structured sparsity support in VLLM for accelerated LLM inference. This sparsification pattern, where only 2 out of every 4 weights are non-zero, can provide up to 1.5x speedup while maintaining model quality.

Changes

Added Cutlass sparse GEMM kernels
Implemented weight matrix conversion to 2:4 sparse format
Modified model loading pipeline to handle sparse weights
Added sparsity configuration options

github-actions · 2024-11-14T15:57:29Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mgoin · 2024-11-14T17:01:28Z

CMakeLists.txt


  FetchContent_Declare(
        cutlass
        GIT_REPOSITORY https://github.com/nvidia/cutlass.git
-        GIT_TAG v3.5.1
+        GIT_TAG be692b48b01620eedabeef8325df5d4eeed6c2ae


Can you keep a git tag? If you switch to commit, then we can't use GIT_SHALLOW TRUE

I don't think so since cutlass tag 3.6.0 is not still out and I need its features for sparse

Faraz9877 added 7 commits October 22, 2024 15:49

Add cutlass 2:4 infrastructure

5d51361

Update with test code

17f5b96

Clean up a bit; both fp8 and int8 working

471a03c

Add fp16 and bf16 support to sparse cutlass mm

0b332fb

Add multiprocessing for kernel sweep benchmarking

ccadad0

Add multi-GPU

807737c

Add cutlass_scaled_sparse_mm op

04c19a5

Faraz9877 requested review from tlrmchlsmth and WoosukKwon as code owners November 14, 2024 15:57

mergify bot added the ci/build label Nov 14, 2024

Clean up

2a85c5a

Faraz9877 closed this Nov 14, 2024

mgoin reviewed Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Add CUTLASS sparse support with argument sweep, heuristics, and torch operators #10335

[Kernel] Add CUTLASS sparse support with argument sweep, heuristics, and torch operators #10335

Faraz9877 commented Nov 14, 2024

github-actions bot commented Nov 14, 2024

mgoin Nov 14, 2024

Faraz9877 Nov 14, 2024

[Kernel] Add CUTLASS sparse support with argument sweep, heuristics, and torch operators #10335

[Kernel] Add CUTLASS sparse support with argument sweep, heuristics, and torch operators #10335

Conversation

Faraz9877 commented Nov 14, 2024

Changes

github-actions bot commented Nov 14, 2024

mgoin Nov 14, 2024

Choose a reason for hiding this comment

Faraz9877 Nov 14, 2024

Choose a reason for hiding this comment