-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support #10995
Merged
robertgshaw2-neuralmagic
merged 102 commits into
vllm-project:main
from
neuralmagic:dipika/rebased
Dec 18, 2024
+2,365
−117
Merged
Changes from 91 commits
Commits
Show all changes
102 commits
Select commit
Hold shift + click to select a range
5d51361
Add cutlass 2:4 infrastructure
Faraz9877 17f5b96
Update with test code
Faraz9877 471a03c
Clean up a bit; both fp8 and int8 working
Faraz9877 0b332fb
Add fp16 and bf16 support to sparse cutlass mm
Faraz9877 da31648
semi_structured for fp16 and bf16 and int8
ilmarkov e655f94
Fix A100 int8 tests
ilmarkov 5fc3c1c
Add fp8 cusparseLt
ilmarkov 9cf36d6
wip
ilmarkov ad09e79
Fix signatures
ilmarkov e75eabc
Fix compilation and tests
ilmarkov 0306390
Update for older platforms
ilmarkov 1021acb
Add benchmarks
ilmarkov 19ce358
Fix typo
ilmarkov 959408c
Added scaled_mm for fp8.
ilmarkov 117b87b
Add docstrings
ilmarkov 2c7e68e
Update for torch 2.5
ilmarkov 922f4f8
Add handling contiguous dense input for int8 and fp8
ilmarkov beca038
Add fp8 cusparseLt
ilmarkov 5d9cd25
Fix compilation and tests
ilmarkov 39ad9d4
Add caching of cusparseLT meta
ilmarkov 520eb62
Cached cusparseLt
ilmarkov 20956e6
Fix destroy function
ilmarkov 87c8088
Prepare for reproduce
ilmarkov 4ea58b1
Fix cusparseLt caching
ilmarkov f0551ef
Make cached version default function
ilmarkov d7476e8
Fixes and polishing after rebase
ilmarkov 681ea5e
add sparse 2:4 weight loading suport
dsikka ecf878f
Some more changes!
rahul-tuli 80952dc
Cleanup
rahul-tuli 8462c9d
get uncompressed to work; update gemm to use contiguous; use alex's u…
dsikka 0a3e506
patch
dsikka 2e28972
use our decompressor
dsikka 28f0abb
Some more work
rahul-tuli c7a97a8
Use new scaled_T function
rahul-tuli ccadad0
Add multiprocessing for kernel sweep benchmarking
Faraz9877 807737c
Add multi-GPU
Faraz9877 04c19a5
Add cutlass_scaled_sparse_mm op
Faraz9877 2a85c5a
Clean up
Faraz9877 1b381c9
Update code
Faraz9877 4e31076
Update code
Faraz9877 13fccf4
Clean up the benchmarking
Faraz9877 b345cc8
Clean up the cutlass benchmarking
Faraz9877 2d03e1d
Fix cmake errors
Faraz9877 e9439cc
Fix the cmake TAG
Faraz9877 4ba7c0f
Merge branch 'buildable' into rahul-quant-merged-rs
robertgshaw2-neuralmagic f74ef37
update
robertgshaw2-neuralmagic f5bc9eb
fixed
robertgshaw2-neuralmagic 1316076
updated
robertgshaw2-neuralmagic 4d2b12c
updated
robertgshaw2-neuralmagic fe30b53
updated, calling things properly
robertgshaw2-neuralmagic 4c61b19
running end to end but not passing
robertgshaw2-neuralmagic 86716f8
updated
robertgshaw2-neuralmagic 349d904
stash
robertgshaw2-neuralmagic b860f9e
update
robertgshaw2-neuralmagic df462b5
Fix batch size and zeros issue
Faraz9877 14f6141
working e2e
robertgshaw2-neuralmagic abfd85d
Enable other datatypes
Faraz9877 540d0ce
Add the heuristics for fp8
Faraz9877 b45c158
updated
robertgshaw2-neuralmagic 5f7339e
Cleanup
rahul-tuli ed15777
Move model compressor to scheme
rahul-tuli 34a84a4
Cleanup + updates for compressed_tensor changes
rahul-tuli 4045bda
Add cherry-picked heuristic for Llama3 8B model
Faraz9877 7c61ab0
updated with latest kernel
robertgshaw2-neuralmagic a8a1b57
Merge pull request #35 from neuralmagic/rob/semi-structured
dsikka 4e060df
remove compressed support; validate against ct models for tp=1,24
dsikka 7a6d027
add testing cases
dsikka 0987c98
add support for all cases; update tests
dsikka c7d1cc3
Merge pull request #37 from neuralmagic/dipika/sem-struc-uncomp
dsikka e5c6d10
Merge branch 'main' into dipika/rebased
dsikka 299fe32
rebase fix?
dsikka fbbd469
Fix the w8a8 build errors
Faraz9877 6dfc5c9
fix int8 so that it works; applying format'
dsikka 4820ebe
support for sparse only ct models
dsikka 2c32ce0
add 2:4 sparse only support, add test cases, add torch.comile workaround
dsikka a27ca81
add int8 quant tests
dsikka 81c4360
Update code
Faraz9877 6d574af
Update code, flip sparse op operand order to be consistent with dense
Faraz9877 72f4577
Clean up code comments
Faraz9877 208b2a0
Format vllm code
Faraz9877 7020096
Update code
Faraz9877 f222aae
Update code
Faraz9877 c7a3a7d
Update code
Faraz9877 94a4945
Update code
Faraz9877 7fedb94
Update code
Faraz9877 3d6c50a
Update code
Faraz9877 8d94e1f
Update code
Faraz9877 b039820
Update code
Faraz9877 67aae3e
Clean up code
Faraz9877 154814f
Update benchmarking code and remove empty files
Faraz9877 ac059b4
Update code
Faraz9877 b559b6a
Push activations and output transposes into CUTLASS code
Faraz9877 4c927a0
Address reviews; one compression test left to pass
Faraz9877 b177ab6
Fix the scale swap bug
Faraz9877 8879323
Clean up benchmarking
Faraz9877 18ba3de
Minimize includes and reformat the compressor file
Faraz9877 c8f573b
Fix an indent
Faraz9877 4916577
Fix compress_entry names and add some comments
Faraz9877 c7b8a2c
Add ElementE for metadata type to spelling exceptions
Faraz9877 0d38f0a
Created the entry + arch structure for the compressor and ignore 2to4…
Faraz9877 c459bbc
Fix minor issues
Faraz9877 2c15abc
Small fix
Faraz9877 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -361,7 +361,8 @@ def main(args: argparse.Namespace): | |
# TODO(vllm-project/vllm/issues/9778): Count molti-modal token length. | ||
print(f"Throughput: {len(requests) / elapsed_time:.2f} requests/s, " | ||
f"{total_num_tokens / elapsed_time:.2f} total tokens/s, " | ||
f"{total_output_tokens / elapsed_time:.2f} output tokens/s") | ||
f"{total_output_tokens / elapsed_time:.2f} output tokens/s, " | ||
f"{total_num_tokens=} | {total_output_tokens=}") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like debug cruft and should be reverted if so There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
|
||
# Output JSON results if specified | ||
if args.output_json: | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be uncommented as FALSE now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure. It's also the default I think but better be explicit as you said.