Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

Merged
merged 2 commits into from
Nov 28, 2024

Conversation

FanShupei
Copy link
Contributor

Supersede #10196

Here I implement IQ4_NL runtime repack for CPU backend. Currently only IQ4_NL_4_4 for Arm Neon, implemented by intrinsics. If you are curious how these intrinsics come (and many potential optimization opportunities), please see #10196 for more infomation, there's lengthy comparison between intrinsics version and original asm version.

I only support runtime repack and not support llama-quantize, since based on discussion in #10196, online repack is the preferable flow. Online repack for IQ4_NL is significant slower than Q4_0, but I haven't done any rigorous measurements. Static quantization support could be added later if anyone really needs it.

@FanShupei
Copy link
Contributor Author

Performance Evaluation

It shows about ~3x speed up for IQ4_NL. Tested on Mac M2 with GGML_METAL=off.

The previous PR #10541 contains more evaluation results.

This PR

model size params backend threads fa test t/s
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp64 98.49 ± 0.82
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp128 97.96 ± 0.83
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp256 95.77 ± 0.10
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 tg64 34.92 ± 0.06
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp64 186.77 ± 0.52
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp128 186.40 ± 0.06
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp256 181.26 ± 0.08
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 tg64 61.11 ± 0.04

build: f56013d (4193)

Master

model size params backend threads fa test t/s
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp64 31.41 ± 0.16
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp128 31.51 ± 0.06
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 pp256 31.12 ± 0.06
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 1 1 tg64 21.13 ± 0.01
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp64 61.33 ± 0.03
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp128 61.24 ± 0.00
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 pp256 60.58 ± 0.20
llama 1B IQ4_NL - 4.5 bpw 733.75 MiB 1.24 B CPU 2 1 tg64 41.13 ± 0.03
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 1 pp64 127.61 ± 0.87
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 1 pp128 127.35 ± 0.11
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 1 pp256 122.99 ± 0.05
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 1 tg64 37.04 ± 0.01
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 1 pp64 247.89 ± 0.81
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 1 pp128 247.35 ± 0.15
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 1 pp256 238.69 ± 0.24
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 1 tg64 65.17 ± 0.04

build: 4a57d36 (4192)

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 27, 2024
Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good. I see a ~2x pp speedup on M3 Max and it doesn't seem to affect the load time too badly.

@FanShupei
Copy link
Contributor Author

I copy comment #10196 to here.

I worry it won't work as expected if we switch to intrinsics. If features are not enabled at compile time, the intrinsics won's compile. It it's enabled at compile time, the compiler may introduce simd instructions in base implementation by auto-vectorization. This is why the CI fails, but I currently has no idea how to fix it.

I'm afraid our current runtime dispatch mechanism actually doesn't work and actually no one really tests it. The original ASM verison also needs dotprod feature, but it doesn't check it...

@slaren
Copy link
Collaborator

slaren commented Nov 27, 2024

Yes I agree, I was aware that this is an issue in x86. The goal for x86 is to bundle multiple versions of the CPU backend for the different instructions sets as a dynamic library and load the best one at startup. We should probably do the same for ARM, in addition to what you mention it is incomplete and not every function that uses the features checks them al runtime.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Here the results on M2 Ultra for 1.5B, 3B and 7B IQ4_NL:

./bin/llama-bench -m ../models/qwen2.5-1.5b-coder/ggml-model-iq4_nl.gguf -m ../models/qwen2.5-3b-coder/ggml-model-iq4_nl.gguf -m ../models/qwen2.5-7b-coder/ggml-model-iq4_nl.gguf -t 8,16 -p 1,2,4,8,16,256 -n 64 -fa 1
model size bac th fa test t/s t/s speedup
qwen2 1.5B 892.20 MiB CPU 8 1 pp1 104.78 ± 4.57 149.73 ± 1.68 1.43
qwen2 1.5B 892.20 MiB CPU 8 1 pp2 119.53 ± 0.78 187.11 ± 11.80 1.57
qwen2 1.5B 892.20 MiB CPU 8 1 pp4 140.03 ± 1.08 349.12 ± 3.08 2.49
qwen2 1.5B 892.20 MiB CPU 8 1 pp8 154.59 ± 1.64 416.07 ± 2.50 2.69
qwen2 1.5B 892.20 MiB CPU 8 1 pp16 163.40 ± 0.37 460.18 ± 8.88 2.82
qwen2 1.5B 892.20 MiB CPU 8 1 pp256 168.80 ± 0.23 505.59 ± 0.73 3.00
qwen2 1.5B 892.20 MiB CPU 8 1 tg64 103.05 ± 0.13 146.19 ± 1.22 1.42
qwen2 1.5B 892.20 MiB CPU 16 1 pp1 122.05 ± 35.35 139.92 ± 34.14 1.15
qwen2 1.5B 892.20 MiB CPU 16 1 pp2 175.91 ± 0.17 233.74 ± 4.39 1.33
qwen2 1.5B 892.20 MiB CPU 16 1 pp4 226.10 ± 1.81 468.70 ± 2.66 2.07
qwen2 1.5B 892.20 MiB CPU 16 1 pp8 267.26 ± 1.15 603.36 ± 49.28 2.26
qwen2 1.5B 892.20 MiB CPU 16 1 pp16 289.27 ± 0.84 771.76 ± 6.92 2.67
qwen2 1.5B 892.20 MiB CPU 16 1 pp256 332.53 ± 0.27 946.58 ± 4.26 2.85
qwen2 1.5B 892.20 MiB CPU 16 1 tg64 138.06 ± 1.81 147.33 ± 7.96 1.07
qwen2 3B 1.70 GiB CPU 8 1 pp1 55.94 ± 1.54 86.27 ± 0.30 1.54
qwen2 3B 1.70 GiB CPU 8 1 pp2 61.67 ± 0.23 105.29 ± 0.15 1.71
qwen2 3B 1.70 GiB CPU 8 1 pp4 70.16 ± 0.20 187.99 ± 2.03 2.68
qwen2 3B 1.70 GiB CPU 8 1 pp8 75.40 ± 0.21 213.79 ± 0.64 2.84
qwen2 3B 1.70 GiB CPU 8 1 pp16 76.98 ± 0.13 230.31 ± 1.53 2.99
qwen2 3B 1.70 GiB CPU 8 1 pp256 81.41 ± 0.44 249.87 ± 3.24 3.07
qwen2 3B 1.70 GiB CPU 8 1 tg64 56.24 ± 0.05 85.54 ± 0.23 1.52
qwen2 3B 1.70 GiB CPU 16 1 pp1 81.54 ± 0.12 87.91 ± 9.37 1.08
qwen2 3B 1.70 GiB CPU 16 1 pp2 98.61 ± 1.32 132.32 ± 0.96 1.34
qwen2 3B 1.70 GiB CPU 16 1 pp4 121.34 ± 0.45 260.49 ± 5.29 2.15
qwen2 3B 1.70 GiB CPU 16 1 pp8 139.94 ± 0.34 349.79 ± 2.05 2.50
qwen2 3B 1.70 GiB CPU 16 1 pp16 139.54 ± 0.39 415.31 ± 0.82 2.98
qwen2 3B 1.70 GiB CPU 16 1 pp256 159.64 ± 0.46 469.05 ± 15.55 2.94
qwen2 3B 1.70 GiB CPU 16 1 tg64 77.62 ± 1.99 89.08 ± 0.60 1.15
qwen2 7B 4.15 GiB CPU 8 1 pp1 27.25 ± 1.46 45.50 ± 0.13 1.67
qwen2 7B 4.15 GiB CPU 8 1 pp2 30.18 ± 0.04 50.24 ± 0.20 1.66
qwen2 7B 4.15 GiB CPU 8 1 pp4 33.20 ± 1.18 94.53 ± 0.15 2.85
qwen2 7B 4.15 GiB CPU 8 1 pp8 34.78 ± 0.12 103.33 ± 0.13 2.97
qwen2 7B 4.15 GiB CPU 8 1 pp16 35.32 ± 0.03 106.36 ± 0.47 3.01
qwen2 7B 4.15 GiB CPU 8 1 pp256 35.79 ± 0.13 113.61 ± 1.46 3.17
qwen2 7B 4.15 GiB CPU 8 1 tg64 28.31 ± 0.03 44.94 ± 0.16 1.59
qwen2 7B 4.15 GiB CPU 16 1 pp1 45.23 ± 0.16 48.33 ± 0.29 1.07
qwen2 7B 4.15 GiB CPU 16 1 pp2 53.55 ± 0.05 64.34 ± 4.75 1.20
qwen2 7B 4.15 GiB CPU 16 1 pp4 61.04 ± 0.19 156.32 ± 1.45 2.56
qwen2 7B 4.15 GiB CPU 16 1 pp8 63.67 ± 2.90 175.07 ± 1.82 2.75
qwen2 7B 4.15 GiB CPU 16 1 pp16 66.32 ± 0.57 190.85 ± 5.79 2.88
qwen2 7B 4.15 GiB CPU 16 1 pp256 70.31 ± 0.03 212.75 ± 1.93 3.03
qwen2 7B 4.15 GiB CPU 16 1 tg64 44.30 ± 0.44 47.26 ± 1.78 1.07

Perplexity for a few chunks seems OK:

./bin/llama-perplexity -m ../models/llama-3.1-8b/ggml-model-iq4_nl.gguf -f ../build/wikitext-2-raw/wiki.test.raw --chunks 4

@slaren slaren merged commit c202cef into ggerganov:master Nov 28, 2024
50 checks passed
float * res_ptr = s;

for (int x = 0; x < nc / ncols_interleaved; x++) {
const block_q4_0x4 * b_ptr = (const block_q4_0x4 *) vx + (x * nb);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a block_iq4_nlx4 not a block_q4_0x4 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this is a typo. Since the two structs happen to have the same layout, It's not a big problem. I'll author a new PR to correct it.

for (int y = 0; y < nr / 4; y++) {
const block_q8_0x4 * a_ptr = (const block_q8_0x4 *) vy + (y * nb);
for (int x = 0; x < nc / ncols_interleaved; x++) {
const block_q4_0x4 * b_ptr = (const block_q4_0x4 *) vx + (x * nb);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a block_iq4_nlx4 not a block_q4_0x4 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants