Why W8A8 is much slower and takes more GPU memory than fp16? #3

Leo-yang-1020 · 2024-07-19T07:39:02Z

When trying to repeat your code, we find that when inferencing using default fp16, the peak memory goes with:
about 9800MB
But when inferencing using W8A8(after PTQ), the peak memory goes with:

about 9900MB
And the inference speed is much slower than fp16
Is it reasonable or I did something wrong?

A-suozhang · 2024-07-19T08:27:27Z

Thank you for your interest in our work. We currently offer the code for "software quantization simulatio." For actual hardware resource savings, it is essential to employ the INT CUDA kernel. We are actively engaged in developing this CUDA kernel implementation.

Leo-yang-1020 · 2024-07-19T08:31:26Z

Thanks for your repeat!

Leo-yang-1020 · 2024-07-19T10:04:06Z

employ

But I still wonder why GPU memory didn't decade? It was the same as fp16. According to your theory and paper, it can reduce to 2.4X, and from my perspective, cuda kernel implementation only effects the inference speed, not memory. I tried W6A6, it appears the same peak memory.

A-suozhang · 2024-07-19T10:12:36Z

In our current Python simulation code, the data format remains in FP16 to facilitate FP16 computations, resulting in a memory cost comparable to that of FP16.

The memory expense is composed of two components: "static," which includes the model weight parameters stored on the GPU, and "dynamic," referring to the activations stored during the computation of the current layer.

Without a low-bit CUDA kernel, the activations must be in FP16 for FP16 computations. While it is possible to store the model weights in a low-bit format (a feature not yet implemented in our current code), these weights would need to be upcast to FP16 for the computation process.

Leo-yang-1020 · 2024-07-19T10:34:10Z

Thanks for your reply! Hope everything going well in your new feature.

xxw11 · 2024-10-17T06:51:29Z

Hello, I would like to ask how to reproduce the memory and latency data provided in the paper? @A-suozhang

xxw11 · 2024-10-17T07:01:26Z

If the values in the paper are obtained through estimation, could you provide the estimation method?

A-suozhang · 2024-10-17T10:53:22Z

You may need customized cuda kernels for actual speedup and memory savings. We are still cleaning up the cuda kernel code, and will release the kernel code soon. Plz stay tuned for the update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why W8A8 is much slower and takes more GPU memory than fp16? #3

Why W8A8 is much slower and takes more GPU memory than fp16? #3

Leo-yang-1020 commented Jul 19, 2024

A-suozhang commented Jul 19, 2024

Leo-yang-1020 commented Jul 19, 2024

Leo-yang-1020 commented Jul 19, 2024

A-suozhang commented Jul 19, 2024

Leo-yang-1020 commented Jul 19, 2024

xxw11 commented Oct 17, 2024

xxw11 commented Oct 17, 2024

A-suozhang commented Oct 17, 2024

Why W8A8 is much slower and takes more GPU memory than fp16? #3

Why W8A8 is much slower and takes more GPU memory than fp16? #3

Comments

Leo-yang-1020 commented Jul 19, 2024

A-suozhang commented Jul 19, 2024

Leo-yang-1020 commented Jul 19, 2024

Leo-yang-1020 commented Jul 19, 2024

A-suozhang commented Jul 19, 2024

Leo-yang-1020 commented Jul 19, 2024

xxw11 commented Oct 17, 2024

xxw11 commented Oct 17, 2024

A-suozhang commented Oct 17, 2024