Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEMM API for efficient LLM inference with W8A16 #1788

Open
oleotiger opened this issue Jan 20, 2024 · 3 comments
Open

GEMM API for efficient LLM inference with W8A16 #1788

oleotiger opened this issue Jan 20, 2024 · 3 comments
Labels
enhancement A feature or an optimization request help wanted platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64

Comments

@oleotiger
Copy link

I want to perform inference on quantized LLAMA (W8A16) on ARM-v9 (with SVE) using oneDNN. The LLAMA weights are per-group quantized.

Based on my understanding, I need to prepack the weights to reduce the cost of repeated packing. However, packing will disrupt the arrangement of per-group quantization scales and shifts. I understand that dequantization needs to be fused with the kernel. If fused with packing, it's equivalent to storing another copy of the weights in FP16, essentially undoing the quantization.

I haven't figured out how to combine prepacking and per-group dequantization.

Which interface should I use for prepacking? SVE instructions can be 256-bit or 512-bit wide; how does oneDNN intelligently handle packing?
After prepacking and saving the weights again, how do I fuse dequantization with the kernel during computation?

@vpirogov vpirogov self-assigned this Jan 23, 2024
@vpirogov
Copy link
Member

@oleotiger, we are working on enabling per-group quantization in oneDNN. You can find description of proposed design for fused weight decompression here. Implementation is not yet available for any platforms though. The only option for now is to decompress weights separately, as you indicated.

@vpirogov
Copy link
Member

+@igorsafo

@vpirogov
Copy link
Member

vpirogov commented Feb 1, 2024

API and validation changes necessary to support W8A16 quantization landed to main and rls-v3.4 branches. Specifics is covered in GPT Quantization RFC.

+@jondea, @milpuz01 for additional comments on Arm specifics.

@vpirogov vpirogov added enhancement A feature or an optimization request and removed question labels Feb 1, 2024
@vpirogov vpirogov added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Mar 29, 2024
@vpirogov vpirogov removed their assignment Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature or an optimization request help wanted platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64
Projects
None yet
Development

No branches or pull requests

2 participants