Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: implement initial support for IQ2 and IQ3 quantizations #11360

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

remyoudompheng
Copy link

@remyoudompheng remyoudompheng commented Jan 22, 2025

This pull request implements basic support for IQ2 and IQ3 quantizations in the Vulkan backend, with tentative acceptable performance (there are probably possible improvements). Unfortunately I do not have access to coopmat2 hardware and there might be typos in the proposed implementation.

A commit modifies the Q3_K implementation to optimize performance, but it may be unwelcome in this PR.

The existing init_iq4nl_shmem function has been renamed to a more generic name in order to simplify ifdef logic.

Tests were performed on a Radeon 780M iGPU with Mesa 24.3.3 using the default compiler (ACO, not LLVM). It supports KHR_coopmat.

Performance results:

ggml_vulkan: 0 = AMD Radeon 780M (RADV GFX1103_R1) (radv) | uma: 1 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 3B IQ2_M - 2.7 bpw       |   1.06 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        479.81 ± 0.67 |
| qwen2 3B IQ2_M - 2.7 bpw       |   1.06 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         26.03 ± 0.08 |
| qwen2 3B IQ3_XS - 3.3 bpw      |   1.29 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        516.05 ± 0.81 |
| qwen2 3B IQ3_XS - 3.3 bpw      |   1.29 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         39.28 ± 0.12 |
| qwen2 3B IQ3_S mix - 3.66 bpw  |   1.38 GiB |     3.09 B | Vulkan     |  99 |         pp512 |       509.56 ± 14.44 |
| qwen2 3B IQ3_S mix - 3.66 bpw  |   1.38 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         35.83 ± 0.55 |
| qwen2 3B Q3_K - Small          |   1.35 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        502.33 ± 0.74 |
| qwen2 3B Q3_K - Small          |   1.35 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         43.28 ± 0.54 |
| qwen2 3B Q3_K - Medium         |   1.48 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        504.33 ± 0.63 |
| qwen2 3B Q3_K - Medium         |   1.48 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         40.19 ± 0.33 |
| qwen2 3B Q4_K - Medium         |   1.79 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        509.78 ± 1.15 |
| qwen2 3B Q4_K - Medium         |   1.79 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         34.43 ± 0.19 |

Performance numbers from test-backend-ops

  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   268.20 us/run - 117.44 MFLOP/run - 437.89 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   541.34 us/run - 117.44 MFLOP/run - 216.94 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   431.30 us/run - 117.44 MFLOP/run - 272.29 GFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                3408 runs -   344.17 us/run - 117.44 MFLOP/run - 341.23 GFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 3408 runs -   365.16 us/run - 117.44 MFLOP/run - 321.62 GFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  2556 runs -   490.09 us/run - 117.44 MFLOP/run - 239.63 GFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                3408 runs -   330.88 us/run - 117.44 MFLOP/run - 354.93 GFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  2556 runs -   392.55 us/run - 117.44 MFLOP/run - 299.18 GFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 2556 runs -   427.37 us/run - 117.44 MFLOP/run - 274.80 GFLOPS

Before Q3_K change:

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   541.34 us/run - 117.44 MFLOP/run - 216.94 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   34 runs - 29597.18 us/run -  60.13 GFLOP/run -   2.03 TFLOPS

After Q3_K change

  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   337.65 us/run - 117.44 MFLOP/run - 347.82 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   52 runs - 19365.63 us/run -  60.13 GFLOP/run -   3.10 TFLOPS

@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 22, 2025
@0cc4m 0cc4m self-requested a review January 23, 2025 07:39
@0cc4m
Copy link
Collaborator

0cc4m commented Jan 23, 2025

Thank you, very cool! This will take a bit of time to review, I'll take a look this weekend. Can you fix the conflict? The mmq_wg_denoms fix has to be applied here too: #11343

@jeffbolznv
Copy link
Collaborator

Exciting to see this. I've done a quick check with coopmat2 and there are a few failures:

  MUL_MAT_ID(type_a=iq2_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NMSE = 0.083714476 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=iq3_xxs,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NMSE = 1.604342081 > 0.000500000 FAIL
  MUL_MAT_ID(type_a=iq3_s,type_b=f32,n_mats=4,n_used=2,b=0,m=512,n=32,k=256): [MUL_MAT_ID] NMSE = 0.533599553 > 0.000500000 FAIL

Not seeing failures with coopmat1 or no coopmat. I'll try to debug these later. I haven't looked at the code yet.

@jeffbolznv jeffbolznv self-requested a review January 23, 2025 16:22
@jeffbolznv
Copy link
Collaborator

I was surprised it was only the MUL_MAT_ID tests failing, but it was due to a gap in test coverage, which #11375 will fix. MUL_MAT also fails for coopmat2 with the same types.

@jeffbolznv
Copy link
Collaborator

I went ahead and did the straightforward unoptimized "port" of the failing dequant callbacks from mul_mm.comp - just divide the index by 2 (because mul_mm does pairs at a time) and replace data_a with the block reference. Code is at 078ebe5. Feel free to pull this in however you want. IMO it would be OK to have these be unoptimized at first and you or I can optimize the later. I haven't done any perf testing yet.

@remyoudompheng
Copy link
Author

Thanks for the comments, I rebased the branch to include #11343 and cherry-picked 078ebe5

Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review that actual dequantization logic, but I reviewed the rest. And I still need to do some perf testing.

{
// copy the table into shared memory and sync
if (gl_LocalInvocationIndex.x < 32) {
for (uint i = gl_LocalInvocationIndex.x; i < 512; i += 32) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loop bound mismatches the array size

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the increment could use gl_WorkGroupSize.x instead of hardcoding 32, but it probably won't affect performance much in practice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop looks better using workgroup size and less error prone, I made the fix

uvec2(0x082b082b, 0x2b2b2b2b), uvec2(0x082b2b08, 0x2b2b2b2b), uvec2(0x2b082b08, 0x2b2b2b2b), uvec2(0x2b2b2b2b, 0x2b2b2b2b)
};

shared uvec2 iq2s_grid[1024];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably account for these array sizes in ggml_vk_matmul_shmem_support. It's a bug that we didn't for iq4_nl, but these are significantly larger.

0x3e1c1c1c, 0x3e1c3404, 0x3e24140c, 0x3e24240c, 0x3e2c0404, 0x3e2c0414, 0x3e2c1424, 0x3e341c04,
};

shared uint32_t iq3xxs_grid[512];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array sizes don't match

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@jeffbolznv
Copy link
Collaborator

I've been working on optimizing the cm2 dequant callbacks and am getting good speedups. I'm out of time to finish it tonight, but I'll share it in the morning.

@jeffbolznv
Copy link
Collaborator

Here are the cm2 optimizations: jeffbolznv@9079f06

RTX 4070 (including Q4_K for reference):

before:
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  616 runs -  1625.19 us/run -  60.13 GFLOP/run -  37.00 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               294 runs -  3402.81 us/run -  60.13 GFLOP/run -  17.67 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                308 runs -  3250.11 us/run -  60.13 GFLOP/run -  18.50 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 324 runs -  3099.17 us/run -  60.13 GFLOP/run -  19.40 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               378 runs -  2656.58 us/run -  60.13 GFLOP/run -  22.63 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 438 runs -  2291.55 us/run -  60.13 GFLOP/run -  26.24 TFLOPS  
  
after:
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  632 runs -  1583.83 us/run -  60.13 GFLOP/run -  37.96 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               596 runs -  1682.00 us/run -  60.13 GFLOP/run -  35.75 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                608 runs -  1647.64 us/run -  60.13 GFLOP/run -  36.49 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 490 runs -  2044.33 us/run -  60.13 GFLOP/run -  29.41 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):               682 runs -  1467.80 us/run -  60.13 GFLOP/run -  40.97 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                 672 runs -  1490.99 us/run -  60.13 GFLOP/run -  40.33 TFLOPS

@remyoudompheng
Copy link
Author

Branch updated:

@remyoudompheng
Copy link
Author

For some reason commit 6ed3047 causes a huge performance regression on my device (Radeon 780M using Mesa ACO compiler or LLVM compiler) as if all benefits from the shared array were lost.

I restored the hardcoded 32 value in 3f7aa9d

@jeffbolznv
Copy link
Collaborator

fix array lengths issues

Looks good.

no cherry-pick of jeffbolznv@9079f06

Do you want me to do this separately after you merge? Either way is OK with me.

no change to shmem checks (they seem quite complex)

The tricky part is that we don't currently track which sizes are supported per-type, and now the shared memory usage depends on the type (well, it did previously for iq4_nl, but we got away with it). Maybe the easiest thing to do is just make bool mul_mat_l and friends into arrays and do the computation for all types. A simpler and less precise way might just be to say that these new types require 48KB of shared memory or more.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any issues on AMD, Intel or Nvidia in my tests. Performance isn't that good yet, but that can be improved separately.

void init_iq_shmem()
{
// copy the table into shared memory and sync
if (gl_LocalInvocationIndex.x < 32) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the shaders that have a variable workgroup size this will cause issues with devices that have a subgroup size smaller than 32. You'll need to do something like what I did in #10809 to fix this.

I have a feeling that this might be what's causing the llvmpipe ci failures as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too bad the version using gl_WorkgroupSize didn't work, because it wouldn't have had that issue. I wonder if writing the loop like this would avoid the perf regression:

    [[unroll]] for (uint i = 0; i < iq2xxs_grid.length(); i += gl_WorkGroupSize.x) {
        iq2xxs_grid[i + gl_LocalInvocationIndex.x] = iq2xxs_grid_const[i + gl_LocalInvocationIndex.x];
    }

This should fully unroll and will work without additional branches as long as the workgroupsize evenly divides the array size.

@netrunnereve
Copy link
Collaborator

Well you beat me to this 😉

As @0cc4m mentioned I think that speed is not really a priority here and the important part is to have a functional implementation we can improve on in the future. This will probably need dedicated mat vec shaders like the K quants to get full performance.

Please see my review but aside from that your code looks fine (I didn't verify the actual dequantization algorithm though) and runs fine on my AMD GCN cards. If you want to be sure that your implementation is correct it's worth running a perplexity check against a different backend to see if the numbers match up.

@remyoudompheng
Copy link
Author

remyoudompheng commented Jan 26, 2025

The bug regarding gl_WorkgroupSize happens at the glslc step, because even the SPIR-V looks wrong (even with [[unroll]]) the loop is compiled with a stride of 1, making each shader invocation very costly.

$ glslc  -DDATA_A_IQ2_XS=1 -DB_TYPE=float -DB_TYPE_VEC2=vec2 -DB_TYPE_VEC4=vec4 -DD_TYPE=float mul_mat_vec.comp -DFLOAT_TYPE=float -o mul_mat_vec_iq2xs.spv
$ spirv-dis mul_mat_vec_iq2xs.spv
...
               ; Function init_iq_shmem_
%init_iq_shmem_ = OpFunction %void None %7

         %11 = OpLabel
        %i_0 =   OpVariable %_ptr_Function_uint Function
  %indexable =   OpVariable %_ptr_Function__arr_v2uint_uint_512 Function
                 OpLine %4 553 0
                 OpStore %i_0 %uint_0
                 OpBranch %65

         %65 = OpLabel
                 OpLine %4 553 0
                 OpLoopMerge %67 %68 None
                 OpBranch %69

         %69 =     OpLabel
                     OpLine %4 553 0
         %70 =       OpLoad %uint %i_0
         %72 =       OpULessThan %bool %70 %uint_512
                     OpBranchConditional %72 %66 %67

         %66 =         OpLabel
                         OpLine %4 554 0
         %77 =           OpLoad %uint %i_0
         %80 =           OpLoad %uint %gl_LocalInvocationIndex
         %81 =           OpIAdd %uint %77 %80
        %676 =           OpLoad %uint %i_0
        %677 =           OpLoad %uint %gl_LocalInvocationIndex
        %678 =           OpIAdd %uint %676 %677
                         OpStore %indexable %675
        %682 =           OpAccessChain %_ptr_Function_v2uint %indexable %678
        %683 =           OpLoad %v2uint %682
        %685 =           OpAccessChain %_ptr_Workgroup_v2uint %iq2xs_grid %81
                         OpStore %685 %683
                         OpBranch %68

         %68 =   OpLabel
                   OpLine %4 553 0
        %687 =     OpLoad %uint %i_0
        %688 =     OpIAdd %uint %687 %uint_1
                   OpStore %i_0 %688
                   OpBranch %65

(notice the %uint_1 on instruction 688, it is %uint_32 when 32 is hardcoded)

It seems to be this issue: KhronosGroup/glslang#2479 (more precisely KhronosGroup/glslang#2627)

The workaround is to include types.comp after the workgroup size is declared in the shader. If this is fine to you I can update the PR to use this method (meaning minor changes to all shaders to move the include, or at least the ones calling init_iq_shmem)

@remyoudompheng
Copy link
Author

PR updated:

llvmpipe is now happy

@jeffbolznv
Copy link
Collaborator

The workaround is to include types.comp after the workgroup size is declared in the shader.

I'm worried that somebody will accidentally break this in the future and it'll be very confusing. Another option might be to pass the workgroup size as a function parameter.

@0cc4m
Copy link
Collaborator

0cc4m commented Jan 26, 2025

I think a function parameter is a good idea, yeah.

@sorasoras
Copy link

looking forward to IQ4XS support as well.

@remyoudompheng
Copy link
Author

remyoudompheng commented Jan 26, 2025

Indeed, it looks better with a function parameter the branch is updated.

@sorasoras feel free to test branch remyoudompheng@e955cbed

@sorasoras
Copy link

Indeed, it looks better with a function parameter the branch is updated.

@sorasoras feel free to test branch remyoudompheng@e955cbed

 .\llama-bench.exe -m W:\model\sakura-14b-qwen2beta-v0.9-IQ4_XS.gguf -ngl 99 -sm none
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 7900 XTX (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |    sm |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ------------: | -------------------: |
| qwen2 13B IQ4_XS - 4.25 bpw    |   7.37 GiB |    14.17 B | Vulkan     |  99 |  none |         pp512 |       1418.05 ± 2.73 |
| qwen2 13B IQ4_XS - 4.25 bpw    |   7.37 GiB |    14.17 B | Vulkan     |  99 |  none |         tg128 |         50.25 ± 0.29 |

build: e955cbed (4570)

it looks ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants