-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: implement initial support for IQ2 and IQ3 quantizations #11360
base: master
Are you sure you want to change the base?
Conversation
Thank you, very cool! This will take a bit of time to review, I'll take a look this weekend. Can you fix the conflict? The mmq_wg_denoms fix has to be applied here too: #11343 |
Exciting to see this. I've done a quick check with coopmat2 and there are a few failures:
Not seeing failures with coopmat1 or no coopmat. I'll try to debug these later. I haven't looked at the code yet. |
I was surprised it was only the MUL_MAT_ID tests failing, but it was due to a gap in test coverage, which #11375 will fix. MUL_MAT also fails for coopmat2 with the same types. |
I went ahead and did the straightforward unoptimized "port" of the failing dequant callbacks from mul_mm.comp - just divide the index by 2 (because mul_mm does pairs at a time) and replace data_a with the block reference. Code is at 078ebe5. Feel free to pull this in however you want. IMO it would be OK to have these be unoptimized at first and you or I can optimize the later. I haven't done any perf testing yet. |
a56c535
to
ecfabd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review that actual dequantization logic, but I reviewed the rest. And I still need to do some perf testing.
{ | ||
// copy the table into shared memory and sync | ||
if (gl_LocalInvocationIndex.x < 32) { | ||
for (uint i = gl_LocalInvocationIndex.x; i < 512; i += 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loop bound mismatches the array size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the increment could use gl_WorkGroupSize.x instead of hardcoding 32, but it probably won't affect performance much in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop looks better using workgroup size and less error prone, I made the fix
uvec2(0x082b082b, 0x2b2b2b2b), uvec2(0x082b2b08, 0x2b2b2b2b), uvec2(0x2b082b08, 0x2b2b2b2b), uvec2(0x2b2b2b2b, 0x2b2b2b2b) | ||
}; | ||
|
||
shared uvec2 iq2s_grid[1024]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably account for these array sizes in ggml_vk_matmul_shmem_support. It's a bug that we didn't for iq4_nl, but these are significantly larger.
0x3e1c1c1c, 0x3e1c3404, 0x3e24140c, 0x3e24240c, 0x3e2c0404, 0x3e2c0414, 0x3e2c1424, 0x3e341c04, | ||
}; | ||
|
||
shared uint32_t iq3xxs_grid[512]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array sizes don't match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
I've been working on optimizing the cm2 dequant callbacks and am getting good speedups. I'm out of time to finish it tonight, but I'll share it in the morning. |
Here are the cm2 optimizations: jeffbolznv@9079f06 RTX 4070 (including Q4_K for reference):
|
890a125
to
6ed3047
Compare
Branch updated:
|
Looks good.
Do you want me to do this separately after you merge? Either way is OK with me.
The tricky part is that we don't currently track which sizes are supported per-type, and now the shared memory usage depends on the type (well, it did previously for iq4_nl, but we got away with it). Maybe the easiest thing to do is just make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any issues on AMD, Intel or Nvidia in my tests. Performance isn't that good yet, but that can be improved separately.
void init_iq_shmem() | ||
{ | ||
// copy the table into shared memory and sync | ||
if (gl_LocalInvocationIndex.x < 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the shaders that have a variable workgroup size this will cause issues with devices that have a subgroup size smaller than 32. You'll need to do something like what I did in #10809 to fix this.
I have a feeling that this might be what's causing the llvmpipe ci failures as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too bad the version using gl_WorkgroupSize didn't work, because it wouldn't have had that issue. I wonder if writing the loop like this would avoid the perf regression:
[[unroll]] for (uint i = 0; i < iq2xxs_grid.length(); i += gl_WorkGroupSize.x) {
iq2xxs_grid[i + gl_LocalInvocationIndex.x] = iq2xxs_grid_const[i + gl_LocalInvocationIndex.x];
}
This should fully unroll and will work without additional branches as long as the workgroupsize evenly divides the array size.
Well you beat me to this 😉 As @0cc4m mentioned I think that speed is not really a priority here and the important part is to have a functional implementation we can improve on in the future. This will probably need dedicated mat vec shaders like the K quants to get full performance. Please see my review but aside from that your code looks fine (I didn't verify the actual dequantization algorithm though) and runs fine on my AMD GCN cards. If you want to be sure that your implementation is correct it's worth running a perplexity check against a different backend to see if the numbers match up. |
The bug regarding
(notice the It seems to be this issue: KhronosGroup/glslang#2479 (more precisely KhronosGroup/glslang#2627) The workaround is to include |
3f7aa9d
to
edd2546
Compare
I'm worried that somebody will accidentally break this in the future and it'll be very confusing. Another option might be to pass the workgroup size as a function parameter. |
I think a function parameter is a good idea, yeah. |
looking forward to IQ4XS support as well. |
edd2546
to
0886297
Compare
Indeed, it looks better with a function parameter the branch is updated. @sorasoras feel free to test branch remyoudompheng@e955cbed |
it looks ok |
This pull request implements basic support for IQ2 and IQ3 quantizations in the Vulkan backend, with tentative acceptable performance (there are probably possible improvements). Unfortunately I do not have access to coopmat2 hardware and there might be typos in the proposed implementation.
A commit modifies the
Q3_K
implementation to optimize performance, but it may be unwelcome in this PR.The existing
init_iq4nl_shmem
function has been renamed to a more generic name in order to simplifyifdef
logic.Tests were performed on a Radeon 780M iGPU with Mesa 24.3.3 using the default compiler (ACO, not LLVM). It supports
KHR_coopmat
.Performance results:
Performance numbers from
test-backend-ops
Before
Q3_K
change:After
Q3_K
change