GEMM API for efficient LLM inference with W8A16 #1788

oleotiger · 2024-01-20T09:37:39Z

I want to perform inference on quantized LLAMA (W8A16) on ARM-v9 (with SVE) using oneDNN. The LLAMA weights are per-group quantized.

Based on my understanding, I need to prepack the weights to reduce the cost of repeated packing. However, packing will disrupt the arrangement of per-group quantization scales and shifts. I understand that dequantization needs to be fused with the kernel. If fused with packing, it's equivalent to storing another copy of the weights in FP16, essentially undoing the quantization.

I haven't figured out how to combine prepacking and per-group dequantization.

Which interface should I use for prepacking? SVE instructions can be 256-bit or 512-bit wide; how does oneDNN intelligently handle packing?
After prepacking and saving the weights again, how do I fuse dequantization with the kernel during computation?

vpirogov · 2024-01-23T00:08:31Z

@oleotiger, we are working on enabling per-group quantization in oneDNN. You can find description of proposed design for fused weight decompression here. Implementation is not yet available for any platforms though. The only option for now is to decompress weights separately, as you indicated.

vpirogov · 2024-01-23T02:53:07Z

+@igorsafo

vpirogov · 2024-02-01T18:27:03Z

API and validation changes necessary to support W8A16 quantization landed to main and rls-v3.4 branches. Specifics is covered in GPT Quantization RFC.

+@jondea, @milpuz01 for additional comments on Arm specifics.

oleotiger added the question label Jan 20, 2024

vpirogov self-assigned this Jan 23, 2024

vpirogov added enhancement A feature or an optimization request and removed question labels Feb 1, 2024

vpirogov added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Mar 29, 2024

vpirogov removed their assignment Jul 16, 2024

vpirogov added the help wanted label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEMM API for efficient LLM inference with W8A16 #1788

GEMM API for efficient LLM inference with W8A16 #1788

oleotiger commented Jan 20, 2024

vpirogov commented Jan 23, 2024

vpirogov commented Jan 23, 2024

vpirogov commented Feb 1, 2024

GEMM API for efficient LLM inference with W8A16 #1788

GEMM API for efficient LLM inference with W8A16 #1788

Comments

oleotiger commented Jan 20, 2024

vpirogov commented Jan 23, 2024

vpirogov commented Jan 23, 2024

vpirogov commented Feb 1, 2024