sycl : offload of get_rows set to false #10432

Alcpz · 2024-11-20T14:18:26Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

#10133 changes changed get_rows from false to true. I've detected a big regression for quantizations that support get_rows (llama3 Q8_0 for example).

@uniartisan Could you share more information of the device you used for offloading (where you saw increased performance)? Or did this just improve testing?

An example of regression:

model	size	params	backend	ngl	sm	mmap	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	none	0	pp512	1340.34 ± 21.74
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	none	0	tg128	88.64 ± 0.05

build: fab5d30 (4143)

With this revert:

model	size	params	backend	ngl	sm	mmap	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	none	0	pp512	5777.93 ± 26.32
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	none	0	tg128	89.31 ± 0.03

build: f4c4ce3

slaren · 2024-11-20T14:32:02Z

Returning true for GGML_OP_GET_ROWS in offload_op will cause the token embeddings to be copied to VRAM, which is almost never worth it since this is a big tensor and this op can be run very cheaply on the CPU. I imagine that RWK uses get_rows in some way that might make it worthwhile copying the weight to VRAM in that case, and that's why @uniartisan saw a speedup, but it needs to be done in a more selective way.

NeoZhangJianyu · 2024-11-21T05:40:58Z

@Alcpz
Which GPU do you test with?

The PR #10133 has no impact to Intel Arc 770 for llama2-7b-q4 and Meta-Llama-3-8B.Q8_0.gguf.

Alcpz · 2024-11-21T09:20:46Z

I´ve tested multiple GPUs. The description has data for a Nvidia A100, but I also tested on an Arc 770 and a Data Center GPU Max 1100. For these two models I see regression in performance, though I'm using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf.

See below additional performance information:

ID	Device Type	Name	Version	units	group	group	size	Driver version
0	[level_zero:gpu:0]	Intel Data Center GPU Max 1100	12.60	448	1024	32	51539M	1.3.30049+10

model	size	params	backend	ngl	threads	sm	mmap	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	16	none	0	pp512	1204.45 ± 6.63
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	16	none	0	tg128	21.83 ± 0.04

build: fab5d30 (4143)

model	size	params	backend	ngl	threads	sm	mmap	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	16	none	0	pp512	3228.17 ± 26.07
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	16	none	0	tg128	21.82 ± 0.02

build: f4c4ce3 (this pr)

ID	Device Type	Name	Version	units	group	group	size	Driver version
0	[level_zero:gpu:0]	Intel Arc A770 Graphics	12.55	512	1024	32	16225M	1.3.30049+10

model	size	params	backend	ngl	threads	sm	mmap	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	16	none	0	pp512	883.37 ± 1.00
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	16	none	0	tg128	14.99 ± 0.00

build: fab5d30 (4143)

model	size	params	backend	ngl	threads	sm	mmap	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	16	none	0	pp512	1288.13 ± 7.94
llama 8B Q8_0	7.95 GiB	8.03 B	SYCL	99	16	none	0	tg128	14.98 ± 0.00

build: f4c4ce3 (this pr)

airMeng · 2024-11-21T12:14:46Z

@NeoZhangJianyu do you mean no regression during decoding phase?

NeoZhangJianyu · 2024-11-29T09:20:25Z

@NeoZhangJianyu do you mean no regression during decoding phase?

I just test the model files as end to end.
Not find the performance change .

NeoZhangJianyu · 2024-11-29T09:20:41Z

@uniartisan
How about your idea? Since you are the author of PR #10133

slaren · 2024-11-29T10:07:05Z

@NeoZhangJianyu I assure you, this is a significant performance problem and needs to be fixed as soon as possible. It's hard to tell why you cannot reproduce this without more details about how you are testing.

Rbiessy · 2024-11-29T10:37:08Z

@NeoZhangJianyu you mentioned testing with Meta-Llama-3-8B.Q8_0.gguf while we are using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf. Could that explain why you are not seeing the same performance drop?

NeoZhangJianyu · 2024-11-29T12:38:06Z

My test ignore the impact to "pp512".

sycl : offload of get_rows set to 0

f4c4ce3

Alcpz requested review from airMeng and NeoZhangJianyu November 20, 2024 14:18

github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Nov 20, 2024

Alcpz changed the title ~~sycl : offload of get_rows set to 0~~ sycl : offload of get_rows set to false Nov 25, 2024

slaren approved these changes Nov 29, 2024

View reviewed changes

NeoZhangJianyu approved these changes Nov 29, 2024

View reviewed changes

NeoZhangJianyu merged commit 0f77aae into ggerganov:master Nov 29, 2024
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sycl : offload of get_rows set to false #10432

sycl : offload of get_rows set to false #10432

Alcpz commented Nov 20, 2024

slaren commented Nov 20, 2024

NeoZhangJianyu commented Nov 21, 2024 •

edited

Loading

Alcpz commented Nov 21, 2024

airMeng commented Nov 21, 2024

NeoZhangJianyu commented Nov 29, 2024

NeoZhangJianyu commented Nov 29, 2024 •

edited

Loading

slaren commented Nov 29, 2024

Rbiessy commented Nov 29, 2024

NeoZhangJianyu commented Nov 29, 2024

sycl : offload of get_rows set to false #10432

sycl : offload of get_rows set to false #10432

Conversation

Alcpz commented Nov 20, 2024

slaren commented Nov 20, 2024

NeoZhangJianyu commented Nov 21, 2024 • edited Loading

Alcpz commented Nov 21, 2024

airMeng commented Nov 21, 2024

NeoZhangJianyu commented Nov 29, 2024

NeoZhangJianyu commented Nov 29, 2024 • edited Loading

slaren commented Nov 29, 2024

Rbiessy commented Nov 29, 2024

NeoZhangJianyu commented Nov 29, 2024

NeoZhangJianyu commented Nov 21, 2024 •

edited

Loading

NeoZhangJianyu commented Nov 29, 2024 •

edited

Loading