awesome-long-context Efficient Inference, Sparse Attention, Efficient KV Cache [2020/01] Reformer: The Efficient Transformer [2020/06] Linformer: Self-Attention with Linear Complexity [2022/12] Parallel Context Windows for Large Language Models [2023/04] Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering [2023/05] Landmark Attention: Random-Access Infinite Context Length for Transformers [2023/05] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time [2023/06] Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time [2023/06] H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models [2023/07] Scaling In-Context Demonstrations with Structured Attention [2023/08] LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models [2023/09] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS [2023/10] HyperAttention: Long-context Attention in Near-Linear Time [2023/10] TRAMS: Training-free Memory Selection for Long-range Language Modeling External Memory & Information Retrieval [2023/06] Augmenting Language Models with Long-Term Memory [2023/06] Long-range Language Modeling with Self-retrieval [2023/07] Focused Transformer: Contrastive Training for Context Scaling Positional Encoding [2021/04] RoFormer: Enhanced Transformer with Rotary Position Embedding [2022/03] Transformer Language Models without Positional Encodings Still Learn Positional Information [2022/04] Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [2022/05] KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation [2022/12] Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis [2022/12] The Impact of Positional Encoding on Length Generalization in Transformers [2023/05] Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings [2023/06] Extending Context Window of Large Language Models via Positional Interpolation [2023/07] Exploring Transformer Extrapolation [2023/09] YaRN: Efficient Context Window Extension of Large Language Models [2023/09] Effective Long-Context Scaling of Foundation Models [2023/10] CLEX: Continuous Length Extrapolation for Large Language Models Context Compression [2022/12] Structured Prompting: Scaling In-Context Learning to 1,000 Examples [2023/05] Efficient Prompting via Dynamic In-Context Learning [2023/05] Adapting Language Models to Compress Contexts [2023/05] Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers [2023/07] In-context Autoencoder for Context Compression in a Large Language Model [2023/10] Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [2023/10] RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation [2023/10] Compressing Context to Enhance Inference Efficiency of Large Language Models [2023/10] LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models [2023/10] LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression [2023/10] TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction Architecture Variances [2021/11] Efficiently Modeling Long Sequences with Structured State Spaces [2022/12] Hungry Hungry Hippos: Towards Language Modeling with State Space Models [2023/02] Hyena Hierarchy: Towards Larger Convolutional Language Models [2023/04] Scaling Transformer to 1M tokens and beyond with RMT [2023/06] Block-State Transformer [2023/07] Retentive Network: A Successor to Transformer for Large Language Models [2023/10] Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors [2023/10] Mamba: Linear-Time Sequence Modeling with Selective State Spaces White-Box [2019/06] Theoretical Limitations of Self-Attention in Neural Sequence Models [2020/06] $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [2022/02] Overcoming a Theoretical Limitation of Self-Attention [2023/05] Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer [2023/10] JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention Long Context Modeling [2023/07] LongNet: Scaling Transformers to 1,000,000,000 Tokens [2023/08] Giraffe: Adventures in Expanding Context Lengths in LLMs [2023/09] LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models [2023/10] Mistral 7B Benchmarks [2020/11] Long Range Arena: A Benchmark for Efficient Transformers [2022/01] SCROLLS: Standardized CompaRison Over Long Language Sequences [2023/01] LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization [2023/05] ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding [2023/08] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [2023/10] M4LE: A MULTI-ABILITY MULTI-RANGE MULTITASK MULTI-DOMAIN LONG-CONTEXT EVALUATION BENCHMARK FOR LARGE LANGUAGE MODELS Data [2023/12] Structured Packing in LLM Training Improves Long Context Utilization [2024/01] LongAlign: A Recipe for Long Context Alignment of Large Language Models [2024/02] Data Engineering for Scaling Language Models to 128K Context Others [2023/07] Zero-th Order Algorithm for Softmax Attention Optimization [2023/10] (Dynamic) Prompting might be all you need to repair Compressed LLMs [2023/10] Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors