Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Q6_KFloatTensor #12

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

srogmann
Copy link
Contributor

This PR contains a Q6_K implementation.

  • Model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, Q6_K
  • CPU: AMD Ryzen 9 7900X
  • JVM: OpenJDK 64-Bit Server VM
  • Linux: 6.9.7-arch1-1
Quant Species Speed
Q6_K S_128_BIT 0.22 tokens/s
Q6_K S_256_BIT (non-array) 0.47 tokens/s, 0.10 tokens/s
Q6_K S_256_BIT (array) 1.26 tokens/s
Q6_K S_256_BIT (512 bits) 0.29 tokens/s
  • Model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF, Q8
Quant Species Speed
Q8_0 S_128_BIT 4.02 tokens/s
Q8_0 S_256_BIT 5.80 tokens/s

@mukel
Copy link
Owner

mukel commented Aug 12, 2024

I experimented running this on a patched Graal compiler with partial Vector API support. I focused on vectorDot256 because that's the most likely to be compiled properly... I reached quite far, everything is compiled properly until the last large block with the sums where I get an exception in the compiler...
The bug seems to be in the compiler internal tracking of the vectors... not because of missing features. I believe that, with minor fixes, Graal will be able to properly compile this. I'll keep you posted.

@srogmann
Copy link
Contributor Author

Did you try vectorDot256Array?

@mukel mukel mentioned this pull request Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants