ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

FanShupei · 2024-11-27T10:01:50Z

Supersede #10196

Here I implement IQ4_NL runtime repack for CPU backend. Currently only IQ4_NL_4_4 for Arm Neon, implemented by intrinsics. If you are curious how these intrinsics come (and many potential optimization opportunities), please see #10196 for more infomation, there's lengthy comparison between intrinsics version and original asm version.

I only support runtime repack and not support llama-quantize, since based on discussion in #10196, online repack is the preferable flow. Online repack for IQ4_NL is significant slower than Q4_0, but I haven't done any rigorous measurements. Static quantization support could be added later if anyone really needs it.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

FanShupei · 2024-11-27T10:09:06Z

Performance Evaluation

It shows about ~3x speed up for IQ4_NL. Tested on Mac M2 with GGML_METAL=off.

The previous PR #10541 contains more evaluation results.

This PR

model	size	params	backend	threads	fa	test	t/s
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp64	98.49 ± 0.82
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp128	97.96 ± 0.83
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp256	95.77 ± 0.10
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	tg64	34.92 ± 0.06
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp64	186.77 ± 0.52
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp128	186.40 ± 0.06
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp256	181.26 ± 0.08
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	tg64	61.11 ± 0.04

build: f56013d (4193)

Master

model	size	params	backend	threads	fa	test	t/s
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp64	31.41 ± 0.16
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp128	31.51 ± 0.06
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	pp256	31.12 ± 0.06
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	1	1	tg64	21.13 ± 0.01
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp64	61.33 ± 0.03
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp128	61.24 ± 0.00
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	pp256	60.58 ± 0.20
llama 1B IQ4_NL - 4.5 bpw	733.75 MiB	1.24 B	CPU	2	1	tg64	41.13 ± 0.03
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	1	pp64	127.61 ± 0.87
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	1	pp128	127.35 ± 0.11
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	1	pp256	122.99 ± 0.05
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	1	tg64	37.04 ± 0.01
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	1	pp64	247.89 ± 0.81
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	1	pp128	247.35 ± 0.15
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	1	pp256	238.69 ± 0.24
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	1	tg64	65.17 ± 0.04

build: 4a57d36 (4192)

slaren

The implementation looks good. I see a ~2x pp speedup on M3 Max and it doesn't seem to affect the load time too badly.

ggml/src/ggml-cpu/ggml-cpu-aarch64.c

FanShupei · 2024-11-27T14:20:19Z

I copy comment #10196 to here.

I worry it won't work as expected if we switch to intrinsics. If features are not enabled at compile time, the intrinsics won's compile. It it's enabled at compile time, the compiler may introduce simd instructions in base implementation by auto-vectorization. This is why the CI fails, but I currently has no idea how to fix it.

I'm afraid our current runtime dispatch mechanism actually doesn't work and actually no one really tests it. The original ASM verison also needs dotprod feature, but it doesn't check it...

slaren · 2024-11-27T14:27:00Z

Yes I agree, I was aware that this is an issue in x86. The goal for x86 is to bundle multiple versions of the CPU backend for the different instructions sets as a dynamic library and load the best one at startup. We should probably do the same for ARM, in addition to what you mention it is incomplete and not every function that uses the features checks them al runtime.

ggerganov

Nice work. Here the results on M2 Ultra for 1.5B, 3B and 7B IQ4_NL:

./bin/llama-bench -m ../models/qwen2.5-1.5b-coder/ggml-model-iq4_nl.gguf -m ../models/qwen2.5-3b-coder/ggml-model-iq4_nl.gguf -m ../models/qwen2.5-7b-coder/ggml-model-iq4_nl.gguf -t 8,16 -p 1,2,4,8,16,256 -n 64 -fa 1

model	size	bac	th	fa	test	t/s	t/s	speedup
qwen2 1.5B	892.20 MiB	CPU	8	1	pp1	104.78 ± 4.57	149.73 ± 1.68	1.43
qwen2 1.5B	892.20 MiB	CPU	8	1	pp2	119.53 ± 0.78	187.11 ± 11.80	1.57
qwen2 1.5B	892.20 MiB	CPU	8	1	pp4	140.03 ± 1.08	349.12 ± 3.08	2.49
qwen2 1.5B	892.20 MiB	CPU	8	1	pp8	154.59 ± 1.64	416.07 ± 2.50	2.69
qwen2 1.5B	892.20 MiB	CPU	8	1	pp16	163.40 ± 0.37	460.18 ± 8.88	2.82
qwen2 1.5B	892.20 MiB	CPU	8	1	pp256	168.80 ± 0.23	505.59 ± 0.73	3.00
qwen2 1.5B	892.20 MiB	CPU	8	1	tg64	103.05 ± 0.13	146.19 ± 1.22	1.42
qwen2 1.5B	892.20 MiB	CPU	16	1	pp1	122.05 ± 35.35	139.92 ± 34.14	1.15
qwen2 1.5B	892.20 MiB	CPU	16	1	pp2	175.91 ± 0.17	233.74 ± 4.39	1.33
qwen2 1.5B	892.20 MiB	CPU	16	1	pp4	226.10 ± 1.81	468.70 ± 2.66	2.07
qwen2 1.5B	892.20 MiB	CPU	16	1	pp8	267.26 ± 1.15	603.36 ± 49.28	2.26
qwen2 1.5B	892.20 MiB	CPU	16	1	pp16	289.27 ± 0.84	771.76 ± 6.92	2.67
qwen2 1.5B	892.20 MiB	CPU	16	1	pp256	332.53 ± 0.27	946.58 ± 4.26	2.85
qwen2 1.5B	892.20 MiB	CPU	16	1	tg64	138.06 ± 1.81	147.33 ± 7.96	1.07
qwen2 3B	1.70 GiB	CPU	8	1	pp1	55.94 ± 1.54	86.27 ± 0.30	1.54
qwen2 3B	1.70 GiB	CPU	8	1	pp2	61.67 ± 0.23	105.29 ± 0.15	1.71
qwen2 3B	1.70 GiB	CPU	8	1	pp4	70.16 ± 0.20	187.99 ± 2.03	2.68
qwen2 3B	1.70 GiB	CPU	8	1	pp8	75.40 ± 0.21	213.79 ± 0.64	2.84
qwen2 3B	1.70 GiB	CPU	8	1	pp16	76.98 ± 0.13	230.31 ± 1.53	2.99
qwen2 3B	1.70 GiB	CPU	8	1	pp256	81.41 ± 0.44	249.87 ± 3.24	3.07
qwen2 3B	1.70 GiB	CPU	8	1	tg64	56.24 ± 0.05	85.54 ± 0.23	1.52
qwen2 3B	1.70 GiB	CPU	16	1	pp1	81.54 ± 0.12	87.91 ± 9.37	1.08
qwen2 3B	1.70 GiB	CPU	16	1	pp2	98.61 ± 1.32	132.32 ± 0.96	1.34
qwen2 3B	1.70 GiB	CPU	16	1	pp4	121.34 ± 0.45	260.49 ± 5.29	2.15
qwen2 3B	1.70 GiB	CPU	16	1	pp8	139.94 ± 0.34	349.79 ± 2.05	2.50
qwen2 3B	1.70 GiB	CPU	16	1	pp16	139.54 ± 0.39	415.31 ± 0.82	2.98
qwen2 3B	1.70 GiB	CPU	16	1	pp256	159.64 ± 0.46	469.05 ± 15.55	2.94
qwen2 3B	1.70 GiB	CPU	16	1	tg64	77.62 ± 1.99	89.08 ± 0.60	1.15
qwen2 7B	4.15 GiB	CPU	8	1	pp1	27.25 ± 1.46	45.50 ± 0.13	1.67
qwen2 7B	4.15 GiB	CPU	8	1	pp2	30.18 ± 0.04	50.24 ± 0.20	1.66
qwen2 7B	4.15 GiB	CPU	8	1	pp4	33.20 ± 1.18	94.53 ± 0.15	2.85
qwen2 7B	4.15 GiB	CPU	8	1	pp8	34.78 ± 0.12	103.33 ± 0.13	2.97
qwen2 7B	4.15 GiB	CPU	8	1	pp16	35.32 ± 0.03	106.36 ± 0.47	3.01
qwen2 7B	4.15 GiB	CPU	8	1	pp256	35.79 ± 0.13	113.61 ± 1.46	3.17
qwen2 7B	4.15 GiB	CPU	8	1	tg64	28.31 ± 0.03	44.94 ± 0.16	1.59
qwen2 7B	4.15 GiB	CPU	16	1	pp1	45.23 ± 0.16	48.33 ± 0.29	1.07
qwen2 7B	4.15 GiB	CPU	16	1	pp2	53.55 ± 0.05	64.34 ± 4.75	1.20
qwen2 7B	4.15 GiB	CPU	16	1	pp4	61.04 ± 0.19	156.32 ± 1.45	2.56
qwen2 7B	4.15 GiB	CPU	16	1	pp8	63.67 ± 2.90	175.07 ± 1.82	2.75
qwen2 7B	4.15 GiB	CPU	16	1	pp16	66.32 ± 0.57	190.85 ± 5.79	2.88
qwen2 7B	4.15 GiB	CPU	16	1	pp256	70.31 ± 0.03	212.75 ± 1.93	3.03
qwen2 7B	4.15 GiB	CPU	16	1	tg64	44.30 ± 0.44	47.26 ± 1.78	1.07

Perplexity for a few chunks seems OK:

./bin/llama-perplexity -m ../models/llama-3.1-8b/ggml-model-iq4_nl.gguf -f ../build/wikitext-2-raw/wiki.test.raw --chunks 4

Djip007 · 2024-11-29T01:37:22Z

ggml/src/ggml-cpu/ggml-cpu-aarch64.c

+        float * res_ptr = s;
+
+        for (int x = 0; x < nc / ncols_interleaved; x++) {
+            const block_q4_0x4 * b_ptr = (const block_q4_0x4 *) vx + (x * nb);


it is a block_iq4_nlx4 not a block_q4_0x4 ?

You are right, this is a typo. Since the two structs happen to have the same layout, It's not a big problem. I'll author a new PR to correct it.

Djip007 · 2024-11-29T01:37:43Z

ggml/src/ggml-cpu/ggml-cpu-aarch64.c

+        for (int y = 0; y < nr / 4; y++) {
+            const block_q8_0x4 * a_ptr = (const block_q8_0x4 *) vy + (y * nb);
+            for (int x = 0; x < nc / ncols_interleaved; x++) {
+                const block_q4_0x4 * b_ptr = (const block_q4_0x4 *) vx + (x * nb);


it is a block_iq4_nlx4 not a block_q4_0x4 ?

ggml-cpu: support IQ4_NL_4_4 by runtime repack

f56013d

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 27, 2024

slaren approved these changes Nov 27, 2024

View reviewed changes

slaren reviewed Nov 27, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.c Outdated Show resolved Hide resolved

slaren reviewed Nov 27, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.c Outdated Show resolved Hide resolved

ggml-cpu: add __ARM_FEATURE_DOTPROD guard

0aa6488

FanShupei force-pushed the repack-iq4_nl branch from 2307659 to 0aa6488 Compare November 27, 2024 14:14

ggerganov approved these changes Nov 28, 2024

View reviewed changes

slaren merged commit c202cef into ggerganov:master Nov 28, 2024
50 checks passed

Djip007 reviewed Nov 29, 2024

View reviewed changes

bartowski1182 mentioned this pull request Nov 29, 2024

backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels #9921

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

FanShupei commented Nov 27, 2024

FanShupei commented Nov 27, 2024

slaren left a comment

FanShupei commented Nov 27, 2024

slaren commented Nov 27, 2024

ggerganov left a comment

Djip007 Nov 29, 2024

FanShupei Nov 29, 2024

Djip007 Nov 29, 2024

ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

ggml-cpu: support IQ4_NL_4_4 by runtime repack #10541

Conversation

FanShupei commented Nov 27, 2024

FanShupei commented Nov 27, 2024

Performance Evaluation

This PR

Master

slaren left a comment

Choose a reason for hiding this comment

FanShupei commented Nov 27, 2024

slaren commented Nov 27, 2024

ggerganov left a comment

Choose a reason for hiding this comment

Djip007 Nov 29, 2024

Choose a reason for hiding this comment

FanShupei Nov 29, 2024

Choose a reason for hiding this comment

Djip007 Nov 29, 2024

Choose a reason for hiding this comment