Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why W8A8 is much slower and takes more GPU memory than fp16? #3

Open
Leo-yang-1020 opened this issue Jul 19, 2024 · 8 comments
Open

Comments

@Leo-yang-1020
Copy link

When trying to repeat your code, we find that when inferencing using default fp16, the peak memory goes with:
截屏2024-07-19 15 30 58 about 9800MB
But when inferencing using W8A8(after PTQ), the peak memory goes with:
image
about 9900MB
And the inference speed is much slower than fp16
Is it reasonable or I did something wrong?

@A-suozhang
Copy link
Member

Thank you for your interest in our work. We currently offer the code for "software quantization simulatio." For actual hardware resource savings, it is essential to employ the INT CUDA kernel. We are actively engaged in developing this CUDA kernel implementation.

@Leo-yang-1020
Copy link
Author

Thanks for your repeat!

@Leo-yang-1020
Copy link
Author

employ

But I still wonder why GPU memory didn't decade? It was the same as fp16. According to your theory and paper, it can reduce to 2.4X, and from my perspective, cuda kernel implementation only effects the inference speed, not memory. I tried W6A6, it appears the same peak memory.

@A-suozhang
Copy link
Member

In our current Python simulation code, the data format remains in FP16 to facilitate FP16 computations, resulting in a memory cost comparable to that of FP16.

The memory expense is composed of two components: "static," which includes the model weight parameters stored on the GPU, and "dynamic," referring to the activations stored during the computation of the current layer.

Without a low-bit CUDA kernel, the activations must be in FP16 for FP16 computations. While it is possible to store the model weights in a low-bit format (a feature not yet implemented in our current code), these weights would need to be upcast to FP16 for the computation process.

@Leo-yang-1020
Copy link
Author

Thanks for your reply! Hope everything going well in your new feature.

@xxw11
Copy link

xxw11 commented Oct 17, 2024

image
Hello, I would like to ask how to reproduce the memory and latency data provided in the paper? @A-suozhang

@xxw11
Copy link

xxw11 commented Oct 17, 2024

If the values in the paper are obtained through estimation, could you provide the estimation method?

@A-suozhang
Copy link
Member

You may need customized cuda kernels for actual speedup and memory savings. We are still cleaning up the cuda kernel code, and will release the kernel code soon. Plz stay tuned for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants