-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541
Conversation
Performance EvaluationIt shows about ~3x speed up for IQ4_NL. Tested on Mac M2 with The previous PR #10541 contains more evaluation results. This PR
build: f56013d (4193) Master
build: 4a57d36 (4192) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good. I see a ~2x pp speedup on M3 Max and it doesn't seem to affect the load time too badly.
2307659
to
0aa6488
Compare
I copy comment #10196 to here.
I'm afraid our current runtime dispatch mechanism actually doesn't work and actually no one really tests it. The original ASM verison also needs dotprod feature, but it doesn't check it... |
Yes I agree, I was aware that this is an issue in x86. The goal for x86 is to bundle multiple versions of the CPU backend for the different instructions sets as a dynamic library and load the best one at startup. We should probably do the same for ARM, in addition to what you mention it is incomplete and not every function that uses the features checks them al runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Here the results on M2 Ultra for 1.5B, 3B and 7B IQ4_NL:
./bin/llama-bench -m ../models/qwen2.5-1.5b-coder/ggml-model-iq4_nl.gguf -m ../models/qwen2.5-3b-coder/ggml-model-iq4_nl.gguf -m ../models/qwen2.5-7b-coder/ggml-model-iq4_nl.gguf -t 8,16 -p 1,2,4,8,16,256 -n 64 -fa 1
model | size | bac | th | fa | test | t/s | t/s | speedup |
---|---|---|---|---|---|---|---|---|
qwen2 1.5B | 892.20 MiB | CPU | 8 | 1 | pp1 | 104.78 ± 4.57 | 149.73 ± 1.68 | 1.43 |
qwen2 1.5B | 892.20 MiB | CPU | 8 | 1 | pp2 | 119.53 ± 0.78 | 187.11 ± 11.80 | 1.57 |
qwen2 1.5B | 892.20 MiB | CPU | 8 | 1 | pp4 | 140.03 ± 1.08 | 349.12 ± 3.08 | 2.49 |
qwen2 1.5B | 892.20 MiB | CPU | 8 | 1 | pp8 | 154.59 ± 1.64 | 416.07 ± 2.50 | 2.69 |
qwen2 1.5B | 892.20 MiB | CPU | 8 | 1 | pp16 | 163.40 ± 0.37 | 460.18 ± 8.88 | 2.82 |
qwen2 1.5B | 892.20 MiB | CPU | 8 | 1 | pp256 | 168.80 ± 0.23 | 505.59 ± 0.73 | 3.00 |
qwen2 1.5B | 892.20 MiB | CPU | 8 | 1 | tg64 | 103.05 ± 0.13 | 146.19 ± 1.22 | 1.42 |
qwen2 1.5B | 892.20 MiB | CPU | 16 | 1 | pp1 | 122.05 ± 35.35 | 139.92 ± 34.14 | 1.15 |
qwen2 1.5B | 892.20 MiB | CPU | 16 | 1 | pp2 | 175.91 ± 0.17 | 233.74 ± 4.39 | 1.33 |
qwen2 1.5B | 892.20 MiB | CPU | 16 | 1 | pp4 | 226.10 ± 1.81 | 468.70 ± 2.66 | 2.07 |
qwen2 1.5B | 892.20 MiB | CPU | 16 | 1 | pp8 | 267.26 ± 1.15 | 603.36 ± 49.28 | 2.26 |
qwen2 1.5B | 892.20 MiB | CPU | 16 | 1 | pp16 | 289.27 ± 0.84 | 771.76 ± 6.92 | 2.67 |
qwen2 1.5B | 892.20 MiB | CPU | 16 | 1 | pp256 | 332.53 ± 0.27 | 946.58 ± 4.26 | 2.85 |
qwen2 1.5B | 892.20 MiB | CPU | 16 | 1 | tg64 | 138.06 ± 1.81 | 147.33 ± 7.96 | 1.07 |
qwen2 3B | 1.70 GiB | CPU | 8 | 1 | pp1 | 55.94 ± 1.54 | 86.27 ± 0.30 | 1.54 |
qwen2 3B | 1.70 GiB | CPU | 8 | 1 | pp2 | 61.67 ± 0.23 | 105.29 ± 0.15 | 1.71 |
qwen2 3B | 1.70 GiB | CPU | 8 | 1 | pp4 | 70.16 ± 0.20 | 187.99 ± 2.03 | 2.68 |
qwen2 3B | 1.70 GiB | CPU | 8 | 1 | pp8 | 75.40 ± 0.21 | 213.79 ± 0.64 | 2.84 |
qwen2 3B | 1.70 GiB | CPU | 8 | 1 | pp16 | 76.98 ± 0.13 | 230.31 ± 1.53 | 2.99 |
qwen2 3B | 1.70 GiB | CPU | 8 | 1 | pp256 | 81.41 ± 0.44 | 249.87 ± 3.24 | 3.07 |
qwen2 3B | 1.70 GiB | CPU | 8 | 1 | tg64 | 56.24 ± 0.05 | 85.54 ± 0.23 | 1.52 |
qwen2 3B | 1.70 GiB | CPU | 16 | 1 | pp1 | 81.54 ± 0.12 | 87.91 ± 9.37 | 1.08 |
qwen2 3B | 1.70 GiB | CPU | 16 | 1 | pp2 | 98.61 ± 1.32 | 132.32 ± 0.96 | 1.34 |
qwen2 3B | 1.70 GiB | CPU | 16 | 1 | pp4 | 121.34 ± 0.45 | 260.49 ± 5.29 | 2.15 |
qwen2 3B | 1.70 GiB | CPU | 16 | 1 | pp8 | 139.94 ± 0.34 | 349.79 ± 2.05 | 2.50 |
qwen2 3B | 1.70 GiB | CPU | 16 | 1 | pp16 | 139.54 ± 0.39 | 415.31 ± 0.82 | 2.98 |
qwen2 3B | 1.70 GiB | CPU | 16 | 1 | pp256 | 159.64 ± 0.46 | 469.05 ± 15.55 | 2.94 |
qwen2 3B | 1.70 GiB | CPU | 16 | 1 | tg64 | 77.62 ± 1.99 | 89.08 ± 0.60 | 1.15 |
qwen2 7B | 4.15 GiB | CPU | 8 | 1 | pp1 | 27.25 ± 1.46 | 45.50 ± 0.13 | 1.67 |
qwen2 7B | 4.15 GiB | CPU | 8 | 1 | pp2 | 30.18 ± 0.04 | 50.24 ± 0.20 | 1.66 |
qwen2 7B | 4.15 GiB | CPU | 8 | 1 | pp4 | 33.20 ± 1.18 | 94.53 ± 0.15 | 2.85 |
qwen2 7B | 4.15 GiB | CPU | 8 | 1 | pp8 | 34.78 ± 0.12 | 103.33 ± 0.13 | 2.97 |
qwen2 7B | 4.15 GiB | CPU | 8 | 1 | pp16 | 35.32 ± 0.03 | 106.36 ± 0.47 | 3.01 |
qwen2 7B | 4.15 GiB | CPU | 8 | 1 | pp256 | 35.79 ± 0.13 | 113.61 ± 1.46 | 3.17 |
qwen2 7B | 4.15 GiB | CPU | 8 | 1 | tg64 | 28.31 ± 0.03 | 44.94 ± 0.16 | 1.59 |
qwen2 7B | 4.15 GiB | CPU | 16 | 1 | pp1 | 45.23 ± 0.16 | 48.33 ± 0.29 | 1.07 |
qwen2 7B | 4.15 GiB | CPU | 16 | 1 | pp2 | 53.55 ± 0.05 | 64.34 ± 4.75 | 1.20 |
qwen2 7B | 4.15 GiB | CPU | 16 | 1 | pp4 | 61.04 ± 0.19 | 156.32 ± 1.45 | 2.56 |
qwen2 7B | 4.15 GiB | CPU | 16 | 1 | pp8 | 63.67 ± 2.90 | 175.07 ± 1.82 | 2.75 |
qwen2 7B | 4.15 GiB | CPU | 16 | 1 | pp16 | 66.32 ± 0.57 | 190.85 ± 5.79 | 2.88 |
qwen2 7B | 4.15 GiB | CPU | 16 | 1 | pp256 | 70.31 ± 0.03 | 212.75 ± 1.93 | 3.03 |
qwen2 7B | 4.15 GiB | CPU | 16 | 1 | tg64 | 44.30 ± 0.44 | 47.26 ± 1.78 | 1.07 |
Perplexity for a few chunks seems OK:
./bin/llama-perplexity -m ../models/llama-3.1-8b/ggml-model-iq4_nl.gguf -f ../build/wikitext-2-raw/wiki.test.raw --chunks 4
float * res_ptr = s; | ||
|
||
for (int x = 0; x < nc / ncols_interleaved; x++) { | ||
const block_q4_0x4 * b_ptr = (const block_q4_0x4 *) vx + (x * nb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a block_iq4_nlx4 not a block_q4_0x4 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this is a typo. Since the two structs happen to have the same layout, It's not a big problem. I'll author a new PR to correct it.
for (int y = 0; y < nr / 4; y++) { | ||
const block_q8_0x4 * a_ptr = (const block_q8_0x4 *) vy + (y * nb); | ||
for (int x = 0; x < nc / ncols_interleaved; x++) { | ||
const block_q4_0x4 * b_ptr = (const block_q4_0x4 *) vx + (x * nb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a block_iq4_nlx4 not a block_q4_0x4 ?
Supersede #10196
Here I implement IQ4_NL runtime repack for CPU backend. Currently only IQ4_NL_4_4 for Arm Neon, implemented by intrinsics. If you are curious how these intrinsics come (and many potential optimization opportunities), please see #10196 for more infomation, there's lengthy comparison between intrinsics version and original asm version.
I only support runtime repack and not support
llama-quantize
, since based on discussion in #10196, online repack is the preferable flow. Online repack for IQ4_NL is significant slower than Q4_0, but I haven't done any rigorous measurements. Static quantization support could be added later if anyone really needs it.