[Kernel] Add CUTLASS sparse support, heuristics, and torch operators #10340

Faraz9877 · 2024-11-14T20:41:13Z

Implements NVIDIA Cutlass 2:4 structured sparsity support in VLLM for LLM inference. This sparsification pattern maintains at most 2 non-zero weights out of every 4 weights in the model.

Changes

Added Cutlass sparse GEMM kernels
Implemented weight matrix conversion to 2:4 sparse format
Modified model loading pipeline to handle sparse weights
Added sparsity configuration options

github-actions · 2024-11-14T20:41:26Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

tlrmchlsmth · 2024-11-14T20:46:12Z

CMakeLists.txt


  FetchContent_Declare(
        cutlass
        GIT_REPOSITORY https://github.com/nvidia/cutlass.git
-        GIT_TAG v3.5.1
+        GIT_TAG be692b48b01620eedabeef8325df5d4eeed6c2ae


Since we are pointing to a TAG, we cannot use GIT_SHALLOW -- see comment a few lines down

tlrmchlsmth · 2024-11-14T20:49:25Z

CMakeLists.txt

+  #
+  # The cutlass_scaled_mm kernels for Hopper (c3x, i.e. CUTLASS 3.x) require
+  # CUDA 12.0 or later (and only work on Hopper, 9.0/9.0a for now).
+  cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0;9.0a" "${CUDA_ARCHS}")
+  if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
+    set(SRCS "csrc/sparse/cutlass/sparse_compressor.cu")
+    set_gencode_flags_for_srcs(
+      SRCS "${SRCS}"
+      CUDA_ARCHS "${SCALED_MM_3X_ARCHS}")
+    list(APPEND VLLM_EXT_SRC "${SRCS}")
+    list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_C3X=1")
+    message(STATUS "Building test_util for archs: ${SCALED_MM_3X_ARCHS}")
+  else()
+    if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
+      message(STATUS "Not building test_util as CUDA Compiler version is "
+                     "not >= 12.0, we recommend upgrading to CUDA 12.0 or "
+                     "later if you intend on running FP8 quantized models on "
+                     "Hopper.")
+    else()
+      message(STATUS "Not building test_util as no compatible archs found "
+                     "in CUDA target architectures")
+    endif()
+
+    # clear SCALED_MM_3X_ARCHS so the scaled_mm_c2x kernels know we didn't 
+    # build any 3x kernels
+    set(SCALED_MM_3X_ARCHS)
+  endif()


Instead of copying this block of code, we should be able to add files to the set defined on line 256:

set(SRCS "csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu", "csrc/sparse/cutlass/sparse_compressor.cu", "csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu")

tlrmchlsmth · 2024-11-14T20:51:22Z

nm_cutlass_c.cmake

is this file needed?

mergify · 2024-11-18T20:04:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Faraz9877.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Faraz9877 added 12 commits October 22, 2024 15:49

Add cutlass 2:4 infrastructure

5d51361

Update with test code

17f5b96

Clean up a bit; both fp8 and int8 working

471a03c

Add fp16 and bf16 support to sparse cutlass mm

0b332fb

Add multiprocessing for kernel sweep benchmarking

ccadad0

Add multi-GPU

807737c

Add cutlass_scaled_sparse_mm op

04c19a5

Clean up

2a85c5a

Update code

1b381c9

Update code

4e31076

Clean up the benchmarking

13fccf4

Clean up the cutlass benchmarking

b345cc8

Faraz9877 requested review from tlrmchlsmth and WoosukKwon as code owners November 14, 2024 20:41

mergify bot added the ci/build label Nov 14, 2024

tlrmchlsmth reviewed Nov 14, 2024

View reviewed changes

Faraz9877 added 4 commits November 14, 2024 23:05

Fix cmake errors

2d03e1d

Fix the cmake TAG

e9439cc

Fix batch size and zeros issue

6870093

Enable other datatypes

8d7b0df

mergify bot added the needs-rebase label Nov 18, 2024

Faraz9877 added 5 commits November 19, 2024 05:40

Add the heuristics for fp8

cdb213d

Add cherry-picked heuristic for Llama3 8B model

246c8ab

Add cherry-picked configs for all data types

4bc043a

Add scaled_sparse_azp op

6773df8

Revert fp8 kernels to best performing version at: 246c8ab

07defd0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Add CUTLASS sparse support, heuristics, and torch operators #10340

[Kernel] Add CUTLASS sparse support, heuristics, and torch operators #10340

Faraz9877 commented Nov 14, 2024

github-actions bot commented Nov 14, 2024

tlrmchlsmth Nov 14, 2024

tlrmchlsmth Nov 14, 2024

tlrmchlsmth Nov 14, 2024

mergify bot commented Nov 18, 2024

[Kernel] Add CUTLASS sparse support, heuristics, and torch operators #10340

Are you sure you want to change the base?

[Kernel] Add CUTLASS sparse support, heuristics, and torch operators #10340

Conversation

Faraz9877 commented Nov 14, 2024

Changes

github-actions bot commented Nov 14, 2024

tlrmchlsmth Nov 14, 2024

Choose a reason for hiding this comment

tlrmchlsmth Nov 14, 2024

Choose a reason for hiding this comment

tlrmchlsmth Nov 14, 2024

Choose a reason for hiding this comment

mergify bot commented Nov 18, 2024