Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sycl : offload of get_rows set to false #10432

Merged
merged 1 commit into from
Nov 29, 2024

Conversation

Alcpz
Copy link
Collaborator

@Alcpz Alcpz commented Nov 20, 2024


#10133 changes changed get_rows from false to true. I've detected a big regression for quantizations that support get_rows (llama3 Q8_0 for example).

@uniartisan Could you share more information of the device you used for offloading (where you saw increased performance)? Or did this just improve testing?

An example of regression:

model size params backend ngl sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 pp512 1340.34 ± 21.74
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 tg128 88.64 ± 0.05

build: fab5d30 (4143)

With this revert:

model size params backend ngl sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 pp512 5777.93 ± 26.32
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 tg128 89.31 ± 0.03

build: f4c4ce3

@slaren
Copy link
Collaborator

slaren commented Nov 20, 2024

Returning true for GGML_OP_GET_ROWS in offload_op will cause the token embeddings to be copied to VRAM, which is almost never worth it since this is a big tensor and this op can be run very cheaply on the CPU. I imagine that RWK uses get_rows in some way that might make it worthwhile copying the weight to VRAM in that case, and that's why @uniartisan saw a speedup, but it needs to be done in a more selective way.

@github-actions github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Nov 20, 2024
@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Nov 21, 2024

@Alcpz
Which GPU do you test with?

The PR #10133 has no impact to Intel Arc 770 for llama2-7b-q4 and Meta-Llama-3-8B.Q8_0.gguf.

@Alcpz
Copy link
Collaborator Author

Alcpz commented Nov 21, 2024

I´ve tested multiple GPUs. The description has data for a Nvidia A100, but I also tested on an Arc 770 and a Data Center GPU Max 1100. For these two models I see regression in performance, though I'm using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf.

See below additional performance information:


ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Data Center GPU Max 1100 12.60 448 1024 32 51539M 1.3.30049+10
model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 pp512 1204.45 ± 6.63
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 tg128 21.83 ± 0.04

build: fab5d30 (4143)

model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 pp512 3228.17 ± 26.07
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 tg128 21.82 ± 0.02

build: f4c4ce3 (this pr)


ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Arc A770 Graphics 12.55 512 1024 32 16225M 1.3.30049+10
model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 pp512 883.37 ± 1.00
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 tg128 14.99 ± 0.00

build: fab5d30 (4143)

model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 pp512 1288.13 ± 7.94
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 16 none 0 tg128 14.98 ± 0.00

build: f4c4ce3 (this pr)

@airMeng
Copy link
Collaborator

airMeng commented Nov 21, 2024

@NeoZhangJianyu do you mean no regression during decoding phase?

@Alcpz Alcpz changed the title sycl : offload of get_rows set to 0 sycl : offload of get_rows set to false Nov 25, 2024
@NeoZhangJianyu
Copy link
Collaborator

@NeoZhangJianyu do you mean no regression during decoding phase?

I just test the model files as end to end.
Not find the performance change .

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Nov 29, 2024

@uniartisan
How about your idea? Since you are the author of PR #10133

@slaren
Copy link
Collaborator

slaren commented Nov 29, 2024

@NeoZhangJianyu I assure you, this is a significant performance problem and needs to be fixed as soon as possible. It's hard to tell why you cannot reproduce this without more details about how you are testing.

@Rbiessy
Copy link
Contributor

Rbiessy commented Nov 29, 2024

@NeoZhangJianyu you mentioned testing with Meta-Llama-3-8B.Q8_0.gguf while we are using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf. Could that explain why you are not seeing the same performance drop?

@NeoZhangJianyu
Copy link
Collaborator

My test ignore the impact to "pp512".

@NeoZhangJianyu NeoZhangJianyu merged commit 0f77aae into ggerganov:master Nov 29, 2024
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants