[Kernel][Hardware][AMD] Add support for GGUF quantization on ROCm #10254

kliuae · 2024-11-12T10:00:42Z

This PR adds support for running GGUF models on ROCm with vLLM.

github-actions · 2024-11-12T10:00:57Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Isotr0py · 2024-11-12T17:29:57Z

We have kernel tests for gguf in tests/kernels/test_gguf.py, can you enable it in run-amd-test.sh?

mgoin · 2024-11-12T20:19:04Z

csrc/quantization/gguf/ggml-common.h

@@ -1,7 +1,7 @@
 // copied from https://github.com/ggerganov/llama.cpp/blob/b2899/ggml-common.h
 #define QK_K 256
 #define K_QUANTS_PER_ITERATION 2
-#define WARP_SIZE 32
+#define WARP_SIZE_GGUF 32


Do you need to change this name? It seems the vast majority of the changes in this PR are due to this rename, so would prefer to keep it

The WARP_SIZE macro here conflicts with one defined in "cuda_compat.h" which has cross-platform utilities that this port uses. For CUDA the macro redefinition may be just fine because they are replaced by the same values. But for ROCm, in the gguf quantization kernel, some of the symbols need to be replaced by values different from that in "cuda_compat.h" while the others are kept the same, to utilize full waves on wave 64 devices. So I thought to change the macro name referenced in the quantization for clarity, though it does make quite a few mundane changes.

Another way could be to change the macro through perhaps undefine/define to the desired values at their respective places. Wouldn't need to change names in those places if we do this but the same symbol would represent different values in different places. I'm fine with either options, and would also be interested in knowing which one you would prefer.

That's clear justification, thank you for your thoughts. I prefer the gguf name as it is more explicit, so if we must make changes with it then let's keep what you have

mgoin · 2024-11-12T20:20:06Z

vllm/_custom_ops.py

@@ -467,6 +442,32 @@ def machete_prepack_B_fake(b_q_weight: torch.Tensor,
        return torch.empty_like(b_q_weight,
                                memory_format=torch.contiguous_format)

+if hasattr(torch.ops._C, "ggml_dequantize"):


Leave a comment on why this can be conditional

Originally the meta kernel was added under another conditional check, albeit checking for another op. I suppose it's so that in platforms where these kernels are not built it wouldn't run into problems loading this script. I'm not very familiar with torch's custom kernel registration, and would appreciate if you could give pointers on the proper way of handling this.

I'm honestly not sure why but I do know that this pattern has been used before - it is a fair check to make

mgoin · 2024-11-20T22:04:33Z

csrc/quantization/gguf/ggml-common.h

@@ -1,7 +1,7 @@
 // copied from https://github.com/ggerganov/llama.cpp/blob/b2899/ggml-common.h
 #define QK_K 256
 #define K_QUANTS_PER_ITERATION 2
-#define WARP_SIZE 32
+#define WARP_SIZE_GGUF 32


That's clear justification, thank you for your thoughts. I prefer the gguf name as it is more explicit, so if we must make changes with it then let's keep what you have

DarkLight1337 · 2024-11-21T08:56:14Z

Please merge from main to fix the CI failures.

Signed-off-by: Maxime Fournioux <[email protected]>

kliuae added 7 commits August 30, 2024 06:11

initial port

71cedee

merge upstream

11e82a0

format

0ca2077

merge upstream

0d8caed

fix warp size 32

02dbde6

run mmvq kernels on 64 thread warps on rocm

03a9216

merge upstream

4904609

kliuae requested review from tlrmchlsmth and WoosukKwon as code owners November 12, 2024 10:00

mergify bot added the ci/build label Nov 12, 2024

mgoin reviewed Nov 12, 2024

View reviewed changes

kliuae added 2 commits November 14, 2024 08:24

enable gguf tests on amd

41c83dd

format

7ee595c

tjtanaa mentioned this pull request Nov 16, 2024

Roadmap EmbeddedLLM/vllm#4

Open

16 tasks

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2024

mgoin approved these changes Nov 20, 2024

View reviewed changes

merge upstream

aaf6474

youkaichao merged commit 7c25fe4 into vllm-project:main Nov 23, 2024
70 of 73 checks passed

mfournioux pushed a commit to mfournioux/vllm that referenced this pull request Nov 28, 2024

[AMD] Add support for GGUF quantization on ROCm (vllm-project#10254)

9f1dbdc

Signed-off-by: Maxime Fournioux <[email protected]>

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[AMD] Add support for GGUF quantization on ROCm (vllm-project#10254)

dd38876

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Hardware][AMD] Add support for GGUF quantization on ROCm #10254

[Kernel][Hardware][AMD] Add support for GGUF quantization on ROCm #10254

kliuae commented Nov 12, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 12, 2024

Isotr0py commented Nov 12, 2024

mgoin Nov 12, 2024

kliuae Nov 14, 2024

mgoin Nov 20, 2024

mgoin Nov 12, 2024

kliuae Nov 14, 2024

mgoin Nov 20, 2024

mgoin Nov 20, 2024

DarkLight1337 commented Nov 21, 2024

[Kernel][Hardware][AMD] Add support for GGUF quantization on ROCm #10254

[Kernel][Hardware][AMD] Add support for GGUF quantization on ROCm #10254

Conversation

kliuae commented Nov 12, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 12, 2024

Isotr0py commented Nov 12, 2024

mgoin Nov 12, 2024

Choose a reason for hiding this comment

kliuae Nov 14, 2024

Choose a reason for hiding this comment

mgoin Nov 20, 2024

Choose a reason for hiding this comment

mgoin Nov 12, 2024

Choose a reason for hiding this comment

kliuae Nov 14, 2024

Choose a reason for hiding this comment

mgoin Nov 20, 2024

Choose a reason for hiding this comment

mgoin Nov 20, 2024

Choose a reason for hiding this comment

DarkLight1337 commented Nov 21, 2024

kliuae commented Nov 12, 2024 •

edited by github-actions bot

Loading