diff --git a/Lectures/W15-KVcahe-WMDP-Tools.pdf b/Lectures/W15-KVcahe-WMDP-Tools.pdf new file mode 100644 index 0000000..477065e Binary files /dev/null and b/Lectures/W15-KVcahe-WMDP-Tools.pdf differ diff --git a/_contents/S0-L27.md b/_contents/S0-L27.md index 1b2cce6..3273759 100755 --- a/_contents/S0-L27.md +++ b/_contents/S0-L27.md @@ -1,7 +1,7 @@ --- layout: post title: Bonus session on KV Cache, Tooling and WMDP -lecture: +lecture: W15-KVcahe-WMDP-Tools lectureVersion: current extraContent: tags: @@ -17,23 +17,11 @@ categories: ### KV Caching in LLM: -+ Retentive Network: A Successor to Transformer for Large Language Models: https://arxiv.org/abs/2307.08621 - -+ https://arxiv.org/abs/2305.13048 RWKV: Reinventing RNNs for the Transformer Era - + grouped query attention: https://arxiv.org/pdf/2305.13245.pdf + Paged attention https://arxiv.org/pdf/2309.06180.pdf https://openreview.net/pdf?id=uNrFpDPMyo -### Retentive Network: A Successor to Transformer for Large Language Models -+ In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation… Show more - - -### RWKV: Reinventing RNNs for the Transformer Era -+ Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. -Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transfor… Show more - ### The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning + Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks @@ -72,7 +60,7 @@ Our approach leverages a linear attention mechanism and allows us to formulate t - +## More readings ### Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond + Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, Xia Hu @@ -81,6 +69,14 @@ Our approach leverages a linear attention mechanism and allows us to formulate t + https://github.com/Mooler0410/LLMsPracticalGuide +### Retentive Network: A Successor to Transformer for Large Language Models ++ In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation… Show more + + +### RWKV: Reinventing RNNs for the Transformer Era ++ Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. +Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transfor… Show more +