Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Head-Specific KV Cache Compression Feature #7

Open
FFY0 opened this issue Nov 21, 2024 · 2 comments
Open

Request for Head-Specific KV Cache Compression Feature #7

FFY0 opened this issue Nov 21, 2024 · 2 comments
Labels
feature request New feature or request good first issue Good for newcomers

Comments

@FFY0
Copy link

FFY0 commented Nov 21, 2024

🚀 Feature

Adding support for head-specific KV cache compression which employs variable compression rates for each attention head.

Motivation

Ada-KV[1] has demonstrated that employing different compression rates across attention heads can significantly enhance cache compression methods. Recently, numerous head-specific approaches, such as DuoAttention[2], RazorAttention[3], and HeadKV[4], have emerged, each introducing unique techniques to improve compression quality through head-specific methods. However, these methods involve handling variable-length cache entries across different heads, a feature that KVPress currently does not support. We believe supporting this feature will significantly enhance the flexibility of KVPress and align it with emerging head-specific compression strategies.

[1] Feng, Y., Lv, J., Cao, Y., Xie, X., & Zhou, S. K. (2024). Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference. arXiv preprint arXiv:2407.11550.
[2] Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., ... & Han, S. (2024). DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads. arXiv preprint arXiv:2410.10819.
[3] Tang, H., Lin, Y., Lin, J., Han, Q., Hong, S., Yao, Y., & Wang, G. (2024). Razorattention: Efficient kv cache compression through retrieval heads. arXiv preprint arXiv:2407.15891.
[4] Fu, Y., Cai, Z., Asi, A., Xiong, W., Dong, Y., & Xiao, W. (2024). Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning. arXiv preprint arXiv:2410.19258.

@SimJeg SimJeg added the good first issue Good for newcomers label Nov 21, 2024
@SimJeg
Copy link
Collaborator

SimJeg commented Nov 21, 2024

Hi @FFY0,

Definitely a good issue, that's a key feature for several compression techniques. However it requires to implement a new kernel to be efficient so it's a significant effort (except if we find a trick... I do have some ideas ^^)

@FFY0
Copy link
Author

FFY0 commented Nov 21, 2024

Thanks, @SimJeg!
Looking forward to the head-specific KV cache compression feature. This will effectively drive progress in the field of head-wise adaptive compression! 🚀

@SimJeg SimJeg added the feature request New feature or request label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants