Request for Head-Specific KV Cache Compression Feature #7

FFY0 · 2024-11-21T13:47:49Z

🚀 Feature

Adding support for head-specific KV cache compression which employs variable compression rates for each attention head.

Motivation

Ada-KV[1] has demonstrated that employing different compression rates across attention heads can significantly enhance cache compression methods. Recently, numerous head-specific approaches, such as DuoAttention[2], RazorAttention[3], and HeadKV[4], have emerged, each introducing unique techniques to improve compression quality through head-specific methods. However, these methods involve handling variable-length cache entries across different heads, a feature that KVPress currently does not support. We believe supporting this feature will significantly enhance the flexibility of KVPress and align it with emerging head-specific compression strategies.

[1] Feng, Y., Lv, J., Cao, Y., Xie, X., & Zhou, S. K. (2024). Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference. arXiv preprint arXiv:2407.11550.
[2] Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., ... & Han, S. (2024). DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads. arXiv preprint arXiv:2410.10819.
[3] Tang, H., Lin, Y., Lin, J., Han, Q., Hong, S., Yao, Y., & Wang, G. (2024). Razorattention: Efficient kv cache compression through retrieval heads. arXiv preprint arXiv:2407.15891.
[4] Fu, Y., Cai, Z., Asi, A., Xiong, W., Dong, Y., & Xiao, W. (2024). Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning. arXiv preprint arXiv:2410.19258.

SimJeg · 2024-11-21T13:53:49Z

Hi @FFY0,

Definitely a good issue, that's a key feature for several compression techniques. However it requires to implement a new kernel to be efficient so it's a significant effort (except if we find a trick... I do have some ideas ^^)

FFY0 · 2024-11-21T14:53:47Z

Thanks, @SimJeg!
Looking forward to the head-specific KV cache compression feature. This will effectively drive progress in the field of head-wise adaptive compression! 🚀

SimJeg added the good first issue Good for newcomers label Nov 21, 2024

SimJeg added the feature request New feature or request label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Head-Specific KV Cache Compression Feature #7

Request for Head-Specific KV Cache Compression Feature #7

FFY0 commented Nov 21, 2024

SimJeg commented Nov 21, 2024

FFY0 commented Nov 21, 2024

Request for Head-Specific KV Cache Compression Feature #7

Request for Head-Specific KV Cache Compression Feature #7

Comments

FFY0 commented Nov 21, 2024

🚀 Feature

Motivation

SimJeg commented Nov 21, 2024

FFY0 commented Nov 21, 2024