Skip to content

Commit

Permalink
adding L27 slide deck for bonus session
Browse files Browse the repository at this point in the history
  • Loading branch information
qiyanjun committed May 1, 2024
1 parent 174b707 commit 8e85faf
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 14 deletions.
Binary file added Lectures/W15-KVcahe-WMDP-Tools.pdf
Binary file not shown.
24 changes: 10 additions & 14 deletions _contents/S0-L27.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: post
title: Bonus session on KV Cache, Tooling and WMDP
lecture:
lecture: W15-KVcahe-WMDP-Tools
lectureVersion: current
extraContent:
tags:
Expand All @@ -17,23 +17,11 @@ categories:

### KV Caching in LLM:

+ Retentive Network: A Successor to Transformer for Large Language Models: https://arxiv.org/abs/2307.08621

+ https://arxiv.org/abs/2305.13048 RWKV: Reinventing RNNs for the Transformer Era

+ grouped query attention: https://arxiv.org/pdf/2305.13245.pdf
+ Paged attention https://arxiv.org/pdf/2309.06180.pdf
https://openreview.net/pdf?id=uNrFpDPMyo


### Retentive Network: A Successor to Transformer for Large Language Models
+ In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation… Show more


### RWKV: Reinventing RNNs for the Transformer Era
+ Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transfor… Show more


### The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
+ Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks
Expand Down Expand Up @@ -72,7 +60,7 @@ Our approach leverages a linear attention mechanism and allows us to formulate t




## More readings

### Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
+ Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, Xia Hu
Expand All @@ -81,6 +69,14 @@ Our approach leverages a linear attention mechanism and allows us to formulate t
+ https://github.com/Mooler0410/LLMsPracticalGuide


### Retentive Network: A Successor to Transformer for Large Language Models
+ In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation… Show more


### RWKV: Reinventing RNNs for the Transformer Era
+ Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transfor… Show more


<!--excerpt.start-->

Expand Down

0 comments on commit 8e85faf

Please sign in to comment.