[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support #10995

dsikka · 2024-12-08T20:42:38Z

Summary

Add sparse quantized and unquantized kernels for CUTLASS 3.x.
Add compressed tensors support for 2of4 Sparse Only, 2of4 Sparse + INT8/FP8 Quantized Models

From Neural Magic

Removed cmake check for cusparseLt, needs to be reverted when the cmake issue is resolved.

…tils instead of our decompressor

LucasWilkinson · 2024-12-12T05:03:19Z

tests/kernels/test_semi_structured.py

+    scale_a = torch.randn((1, 1), device="cuda", dtype=torch.float32) / 10
+    scale_b = torch.randn((1, 1), device="cuda", dtype=torch.float32) / 10
+
+    print("in test")


nit: remove cruft

LucasWilkinson · 2024-12-12T05:07:31Z

csrc/sparse/cutlass/sparse_scaled_mm_entry.cu

+
+  at::cuda::OptionalCUDAGuard const device_guard(device_of(a));
+  int32_t version_num = test_get_sm_version_num();
+  // Hopper


nit: what's this comment for?

LucasWilkinson · 2024-12-12T05:08:13Z

tests/kernels/test_semi_structured.py

maybe for a future PR but there should be more tests here, test more shapes, there should be and opcheck test (see test_cutlass_support_opcheck), a cuda graph test (see test_cutlass_cuda_graph). Use vllm/tests/kernels/test_cutlass.py as inspiration (with the exception of the azp stuff I assume)

tlrmchlsmth · 2024-12-12T21:23:30Z

benchmarks/benchmark_throughput.py

@@ -361,7 +361,8 @@ def main(args: argparse.Namespace):
        # TODO(vllm-project/vllm/issues/9778): Count molti-modal token length.
    print(f"Throughput: {len(requests) / elapsed_time:.2f} requests/s, "
          f"{total_num_tokens / elapsed_time:.2f} total tokens/s, "
-          f"{total_output_tokens / elapsed_time:.2f} output tokens/s")
+          f"{total_output_tokens / elapsed_time:.2f} output tokens/s, "
+          f"{total_num_tokens=} | {total_output_tokens=}")


This looks like debug cruft and should be reverted if so

tlrmchlsmth · 2024-12-12T21:43:35Z

csrc/cutlass_extensions/common.hpp

+inline uint32_t next_pow_2(uint32_t const num) {
+  if (num <= 1) return num;
+  return 1 << (CHAR_BIT * sizeof(num) - __builtin_clz(num - 1));
+}


could you put this in csrc/core/math.hpp? @SageMoore is adding similar utilities to that file in #10867

tlrmchlsmth · 2024-12-12T21:45:04Z

csrc/cutlass_extensions/common.hpp

+#define CUDA_CHECK(status)                                              \
+  {                                                                     \
+    cudaError_t error = status;                                         \
+    if (error != cudaSuccess) {                                         \
+      std::cerr << "Got bad cuda status: " << cudaGetErrorString(error) \
+                << " at line: " << __LINE__ << std::endl;               \
+      exit(EXIT_FAILURE);                                               \
+    }                                                                   \
+  }


We should throw an exception here, and it should behave generally the same way that CUTLASS_CHECK does.
(I do like the line number reporting though, so it would be nice if you could add it to both)

Suggested change

#define CUDA_CHECK(status) \

{ \

cudaError_t error = status; \

if (error != cudaSuccess) { \

std::cerr << "Got bad cuda status: " << cudaGetErrorString(error) \

<< " at line: " << __LINE__ << std::endl; \

exit(EXIT_FAILURE); \

} \

}

#define CUDA_CHECK(status) \

{ \

TORCH_CHECK(status == cudaSuccess, \

cudaGetErrorString(status)) \

}

tlrmchlsmth · 2024-12-12T21:51:52Z

csrc/sparse/cutlass/sparse_compressor.cu

+#include <cudaTypedefs.h>
+
+#include <torch/all.h>
+
+#include <ATen/cuda/CUDAContext.h>
+
+#include <iostream>
+#include <sstream>
+#include <vector>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/numeric_conversion.h"
+#include "cutlass/detail/dependent_false.hpp"
+
+#include "cutlass_extensions/epilogue/broadcast_load_epilogue_c3x.hpp"
+#include "cutlass_extensions/common.hpp"
+
+#include "cutlass/transform/device/transform_universal_adapter.hpp"
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"
+
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/dispatch_policy.hpp"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+
+#include "cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp"
+#include "sparse_scaled_mm_c3x.cuh"


Please clean up these includes. I see some duplicates. Could you try to minimize the number of includes? I.E. no duplicates, and nothing that's unnecessary?

Also please turn clang-format off for the includes, as CUTLASS headers don't tolerate reordering.

// clang-format will break include orders // clang-format off #include "your.h" #include "includes.h" #include "here.h" // clang-format on

These should be pared down further.

For example:
"cutlass_extensions/epilogue/scaled_mm_epilogues_c3x.hpp" already includes "cutlass_extensions/epilogue/broadcast_load_epilogue_c3x.hpp" and most of our CUTLASS kernels don't interact directly with the code in broadcast_load_epilogue_c3x.hpp so they should only include scaled_mm_epilogues_c3x.hpp.

However this sparsify_and_compress kernel doesn't use any epilogues at all so it shouldn't include either of them.

Could you take another look at these includes and the includes in your other kernels as well?

Done. The CUTLASS's CompressorUtility necessitates that a Gemm be defined with all operand types, schedules, etc with an epilogue, albeit the default. I had previously used my default gemm config with ScaledEpilogue for this Gemm but per this review, I replaced that with an on-the-spot Gemm kernel setup similar to the examples provided in CUTLASS. I am also mentioning this in a comment in the code now.

tlrmchlsmth · 2024-12-12T21:53:08Z

csrc/sparse/cutlass/sparse_compressor.cu

+  // Just a dummy value
+  int32_t n = 128;


Could you expand on this comment?

It was just needed to instantiate a problem shape to use the compressor utility in CUTLASS. I replaced it with 1 in the problem shape directly.

Please put this in a comment in the code so that it is documented there

tlrmchlsmth · 2024-12-12T21:56:17Z

csrc/sparse/cutlass/sparse_compressor.cu

+  // Check for strides and alignment
+  TORCH_CHECK(a.stride(1) == 1)


is there any requirement for the divisibility of a.stride(0)? Do we test odd values of m?

No. Since we're doing column-major output in the kernels, there's no requirement. For row-major output, the batch size has to be a multiple of 8.

I thought this was the weight matrix, so batch isn't relevant here

You're right, my bad for misunderstanding. The intermediate dimension of the matmul should be divisible by 4 to be able to follow the 2:4 sparsity. So a.stride(0) % 4 == 0 must hold. I added a check for this divisibility.

tlrmchlsmth · 2024-12-12T22:07:10Z

csrc/sparse/cutlass/sparse_scaled_mm_c3x.cuh

+
+   Epilogue functions can be defined to post-process the output before it is
+   written to GPU memory.
+   Epilogues must contain a public type named EVTCompute of type Sm90EVT,
+   as well as a static prepare_args function that constructs an
+   EVTCompute::Arguments struct.


Since this comment is epilogue-specific and the epilogues are not defined in this file, I think this comment should be removed

tlrmchlsmth · 2024-12-12T22:16:00Z

vllm/_custom_ops.py

+def cutlass_compress_entry(a: torch.Tensor) \
+    -> Tuple[torch.Tensor, torch.Tensor]:
+    assert (a.dtype in [
+        torch.int8, torch.float8_e4m3fn, torch.bfloat16, torch.float16
+    ])
+
+    # e.dtype: torch.uint8 so elemsPerElemE = 8b / 2b_per_nz = 4
+    elemsPerElemE = 4
+
+    m = a.shape[0]
+    k = a.shape[1]
+    a_compressed = torch.empty((m, k // 2), dtype=a.dtype, device=a.device)
+    e = torch.empty((m, k // 2 // elemsPerElemE),
+                    dtype=torch.uint8,
+                    device=a.device)
+
+    if not (torch.ops._C.cutlass_compress_entry(a_compressed, e, a)):
+        raise ValueError
+
+    return a_compressed, e
+
+
+def cutlass_scaled_sparse_mm(


Could you add high-level comments for what these are doing? In particular could you describe what e is?

vllm/_custom_ops.py

csrc/sparse/cutlass/sparse_compressor.cu

tlrmchlsmth

Looks good to me now, thanks for the hard work!

LucasWilkinson

LGTM too, just left a few very minor refactor/comment nits. Thanks for the hardwork and iterations!

LucasWilkinson · 2024-12-16T15:44:25Z

csrc/torch_bindings.cpp

+  ops.def(
+      "cutlass_scaled_sparse_mm(Tensor! out, Tensor a,"
+      "                         Tensor b,"
+      "                         Tensor e, Tensor a_scales,"


nit: can you update argument naming to match, i.e. bt_nzs and bt_meta

LucasWilkinson · 2024-12-16T15:46:07Z

csrc/sparse/cutlass/sparse_scaled_mm_c3x.cuh

+  using ElementAB = typename Gemm::ElementAB;
+  using ElementD = typename Gemm::ElementD;
+
+  // Interface stride expected from the argument a (will get transposed)


nit: can you elaborate on this a bit, i.e. add something about the fact that we compute C^t = B^t @ A^t but we assume B is transposed before compressing hence the bt_<x> naming

LucasWilkinson · 2024-12-16T15:46:54Z

csrc/sparse/cutlass/sparse_scaled_mm_c3x.cuh

+  auto layout_A = make_cute_layout<StrideA>(a, "A");
+  auto layout_D = make_cute_layout<StrideD>(out, "D");
+
+  auto stride_At = layout_A.stride();


nit: can you add a comment here explaining why At is the same stride as A for cutlass

LucasWilkinson · 2024-12-16T15:47:54Z

csrc/sparse/cutlass/sparse_scaled_mm_c3x.cuh

+
+  using GemmKernel = typename Gemm::GemmKernel;
+  typename GemmKernel::ProblemShape prob_shape{
+      (int)bt_nzs.size(0), (int)size<0>(layout_A), (int)size<1>(layout_A), 1};


nit: we should avoid c-style casts for consistency (use static_cast here)

LucasWilkinson · 2024-12-16T15:49:06Z

csrc/torch_bindings.cpp

+
+  // CUTLASS sparse matrix compressor
+  ops.def(
+      "cutlass_sparse_compress_entry(Tensor! a_compressed, Tensor! e,"


nit: maybe update this to match the argument naming for cutlass_scaled_sparse_mm i.e. Tensor! a_nzs, Tensor! a_meta

LucasWilkinson · 2024-12-16T15:50:38Z

csrc/sparse/cutlass/sparse_compressor.cu

+
+/// Make A structured sparse by replacing elements with 0 and compress it
+template <typename ElementA_, typename ElementAcc_>
+bool cutlass_sparse_compress(torch::Tensor& a_compressed, torch::Tensor& e,


nit: maybe update this to match the argument naming for cutlass_scaled_sparse_mm i.e. Tensor! a_nzs, Tensor! a_meta

ProExpertProg · 2024-12-16T22:22:41Z

csrc/cutlass_extensions/common.hpp

+ * Helper function for checking CUTLASS errors
+ */
+#define CUTLASS_CHECK(status)                        \
+  {                                                  \


Maybe extract status first (like below) so this macro can directly wrap expressions like function calls and not double-evaluate them?

mgoin · 2024-12-16T21:32:07Z

CMakeLists.txt

        GIT_PROGRESS TRUE

        # Speed up CUTLASS download by retrieving only the specified GIT_TAG instead of the history.
        # Important: If GIT_SHALLOW is enabled then GIT_TAG works only with branch names and tags.
        # So if the GIT_TAG above is updated to a commit hash, GIT_SHALLOW must be set to FALSE
-        GIT_SHALLOW TRUE
+        # GIT_SHALLOW FALSE


Should this be uncommented as FALSE now?

Suggested change

# GIT_SHALLOW FALSE

GIT_SHALLOW FALSE

Yeah sure. It's also the default I think but better be explicit as you said.

mgoin · 2024-12-16T21:33:35Z

benchmarks/cutlass_benchmarks/sparse_benchmarks.py

+    scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32)
+    scale_b = torch.tensor(1.0, device="cuda", dtype=torch.float32)


future work: what about per-channel/per-token scales?

Yeah. We can also use that for benchmarking. I put this here only because it's similar to the dense benchmarking script.

mgoin · 2024-12-16T22:48:51Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py

+
+    @classmethod
+    def get_min_capability(cls) -> int:
+        return 90


Worth leaving a note that this is due to cutlass 3.x kernel restrictions since we do have fp16+int8 support here

… tests for >90 sm capability

…#10995) Co-authored-by: Faraz Shahsavan <[email protected]> Co-authored-by: ilmarkov <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Co-authored-by: [email protected] <[email protected]> Signed-off-by: Sage Moore <[email protected]>

…-project#10995)" This reverts commit 60508ff.

Faraz9877 and others added 30 commits October 22, 2024 15:49

Add cutlass 2:4 infrastructure

5d51361

Update with test code

17f5b96

Clean up a bit; both fp8 and int8 working

471a03c

Add fp16 and bf16 support to sparse cutlass mm

0b332fb

semi_structured for fp16 and bf16 and int8

da31648

Fix A100 int8 tests

e655f94

Add fp8 cusparseLt

5fc3c1c

wip

9cf36d6

Fix signatures

ad09e79

Fix compilation and tests

e75eabc

Update for older platforms

0306390

Add benchmarks

1021acb

Fix typo

19ce358

Added scaled_mm for fp8.

959408c

Removed cmake check for cusparseLt, needs to be reverted when the cmake issue is resolved.

Add docstrings

117b87b

Update for torch 2.5

2c7e68e

Add handling contiguous dense input for int8 and fp8

922f4f8

Add fp8 cusparseLt

beca038

Fix compilation and tests

5d9cd25

Add caching of cusparseLT meta

39ad9d4

Cached cusparseLt

520eb62

Fix destroy function

20956e6

Prepare for reproduce

87c8088

Fix cusparseLt caching

4ea58b1

Make cached version default function

f0551ef

Fixes and polishing after rebase

d7476e8

add sparse 2:4 weight loading suport

681ea5e

Some more changes!

ecf878f

Cleanup

80952dc

get uncompressed to work; update gemm to use contiguous; use alex's u…

8462c9d

…tils instead of our decompressor

LucasWilkinson reviewed Dec 12, 2024

View reviewed changes

Faraz9877 added 4 commits December 12, 2024 07:11

Update code

b039820

Clean up code

67aae3e

Update benchmarking code and remove empty files

154814f

Update code

ac059b4

tlrmchlsmth reviewed Dec 12, 2024

View reviewed changes

Faraz9877 added 6 commits December 13, 2024 07:45

Push activations and output transposes into CUTLASS code

b559b6a

Address reviews; one compression test left to pass

4c927a0

Fix the scale swap bug

b177ab6

Clean up benchmarking

8879323

Minimize includes and reformat the compressor file

18ba3de

Fix an indent

c8f573b

robertgshaw2-neuralmagic mentioned this pull request Dec 16, 2024

[Release]: v0.7.0 Release Tracker #11218

Open

2 tasks

tlrmchlsmth approved these changes Dec 16, 2024

View reviewed changes

LucasWilkinson approved these changes Dec 16, 2024

View reviewed changes

robertgshaw2-neuralmagic added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2024

Fix compress_entry names and add some comments

4916577

tlrmchlsmth mentioned this pull request Dec 16, 2024

[Kernel] Refactor Cutlass c3x #10049

Merged

Add ElementE for metadata type to spelling exceptions

c7b8a2c

ProExpertProg reviewed Dec 16, 2024

View reviewed changes

mgoin reviewed Dec 16, 2024

View reviewed changes

Faraz9877 added 3 commits December 17, 2024 21:32

Created the entry + arch structure for the compressor and ignore 2to4…

0d38f0a

… tests for >90 sm capability

Fix minor issues

c459bbc

Small fix

2c15abc

robertgshaw2-neuralmagic merged commit 60508ff into vllm-project:main Dec 18, 2024
76 checks passed

ProExpertProg added a commit to neuralmagic/vllm that referenced this pull request Dec 20, 2024

Revert "[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (vllm…

3d19402

…-project#10995)" This reverts commit 60508ff.

		// Check for strides and alignment
		TORCH_CHECK(a.stride(1) == 1)

		scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32)
		scale_b = torch.tensor(1.0, device="cuda", dtype=torch.float32)

[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support #10995

[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support #10995

Conversation

dsikka commented Dec 8, 2024 • edited by github-actions bot Loading

Summary

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrmchlsmth left a comment

Choose a reason for hiding this comment

LucasWilkinson left a comment

Choose a reason for hiding this comment

LucasWilkinson Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsikka commented Dec 8, 2024 •

edited by github-actions bot

Loading

LucasWilkinson Dec 16, 2024 •

edited

Loading