-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sycl : offload of get_rows set to false #10432
sycl : offload of get_rows set to false #10432
Conversation
Returning |
I´ve tested multiple GPUs. The description has data for a Nvidia A100, but I also tested on an Arc 770 and a Data Center GPU Max 1100. For these two models I see regression in performance, though I'm using See below additional performance information:
build: fab5d30 (4143)
build: f4c4ce3 (this pr)
build: fab5d30 (4143)
build: f4c4ce3 (this pr) |
@NeoZhangJianyu do you mean no regression during decoding phase? |
I just test the model files as end to end. |
@uniartisan |
@NeoZhangJianyu I assure you, this is a significant performance problem and needs to be fixed as soon as possible. It's hard to tell why you cannot reproduce this without more details about how you are testing. |
@NeoZhangJianyu you mentioned testing with |
My test ignore the impact to "pp512". |
#10133 changes changed get_rows from false to true. I've detected a big regression for quantizations that support get_rows (llama3 Q8_0 for example).
@uniartisan Could you share more information of the device you used for offloading (where you saw increased performance)? Or did this just improve testing?
An example of regression:
build: fab5d30 (4143)
With this revert:
build: f4c4ce3