Paper Link: https://arxiv.org/abs/2409.01990
Article Title | Year | Subfield | Link |
---|---|---|---|
Quarot: Outlier-free 4-bit inference in rotated LLMs | 2024 | Quantization | Link |
Sparse-quantized representation for near-lossless LLM weight compression | 2023 | Quantization | Link |
GPTQ: Accurate post-training quantization for generative pre-trained transformers | 2022 | Quantization | Link |
Llama.int8(): 8-bit matrix multiplication for transformers at scale | 2022 | Quantization | Link |
Llm-qat: Data-free quantization aware training for large language models | 2023 | Quantization | Link |
Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation | 2024 | Quantization | Link |
OneBit: Towards Extremely Low-bit Large Language Models | 2024 | Quantization | Link |
QLoRA: efficient finetuning of quantized LLMs | 2023 | Quantization | Link |
Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization | 2023 | Quantization | Link |
Loftq: Lora-fine-tuning-aware quantization for large language models | 2023 | Quantization | Link |
L4q: Parameter efficient quantization-aware training on large language models via lora-wise lsq | 2024 | Quantization | Link |
Qa-lora: Quantization-aware low-rank adaptation of large language models | 2023 | Quantization | Link |
Efficientqat: Efficient quantization-aware training for large language models | 2024 | Quantization | Link |
Accurate lora-finetuning quantization of llms via information retention | 2024 | Quantization | Link |
EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge | 2024 | Quantization | Link |
Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models | 2022 | Quantization | Link |
SpQR: A sparse-quantized representation for near-lossless llm weight compression | 2023 | Quantization | Link |
Olive: Accelerating large language models via hardware-friendly outlier victim pair quantization | 2023 | Quantization | Link |
Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks | 2024 | Quantization | Link |
Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation | 2023 | Quantization | Link |
Smoothquant: Accurate and efficient post-training quantization for large language models | 2023 | Quantization | Link |
Spinquant–llm quantization with learned rotations | 2024 | Quantization | Link |
Atom: Low-bit quantization for efficient and accurate llm serving | 2024 | Quantization | Link |
Flexgen: High-throughput generative inference of large language models with a single gpu | 2024 | Quantization | Link |
Compression of generative pre-trained language models via quantization | 2022 | Quantization | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
On-policy distillation of language models: Learning from self-generated mistakes | 2024 | Knowledge Distillation | Link |
Knowledge distillation for closed-source language models | 2024 | Knowledge Distillation | Link |
Flamingo: A visual language model for few-shot learning | 2022 | Knowledge Distillation | Link |
In-context learning distillation: Transferring few-shot learning ability of pre-trained language models | 2022 | Knowledge Distillation | Link |
Less is more: Task-aware layer-wise distillation for language model compression | 2022 | Knowledge Distillation | Link |
Knowledge Distillation for Closed-Source Language Models | 2024 | Knowledge Distillation | Link |
MiniLLM: Knowledge distillation of large language models | 2024 | Knowledge Distillation | Link |
Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty | 2023 | Knowledge Distillation | Link |
Propagating knowledge updates to lms through distillation | 2023 | Knowledge Distillation | Link |
On-policy distillation of language models: Learning from self-generated mistakes | 2024 | Knowledge Distillation | Link |
Metaicl: Learning to learn in context | 2022 | Knowledge Distillation | Link |
In-context learning distillation: Transferring few-shot learning ability of pre-trained language models | 2022 | Knowledge Distillation | Link |
Multistage collaborative knowledge distillation from a large language model for semi-supervised sequence generation | 2024 | Knowledge Distillation | Link |
Layer-wise Knowledge Distillation for BERT Fine-Tuning | 2022 | Knowledge Distillation | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
SparseGPT: Massive language models can be accurately pruned in one-shot | 2023 | Unstructured; Saliency | Link |
A simple and effective pruning approach for large language models | 2023 | Unstructured; Saliency | Link |
One-shot sensitivity-aware mixed sparsity pruning for large language models | 2024 | Unstructured; Saliency | Link |
Pruning as a Domain-specific LLM Extractor | 2024 | Unstructured; Saliency | Link |
Plug-and-play: An efficient post-training pruning method for large language models | 2024 | Unstructured; Saliency | Link |
Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models | 2024 | Unstructured; Saliency | Link |
Prune and tune: Improving efficient pruning techniques for massive language models | 2023 | Unstructured; Saliency | Link |
Language-Specific Pruning for Efficient Reduction of Large Language Models | 2024 | Unstructured; Saliency | Link |
Besa: Pruning large language models with blockwise parameter-efficient sparsity allocation | 2024 | Unstructured; Optimization | Link |
LLM-Pruner: On the structural pruning of large language models | 2023 | Structured; Saliency | Link |
Fluctuation-based adaptive structured pruning for large language models | 2024 | Structured; Saliency | Link |
Structured Pruning for Large Language Models Using Coupled Components Elimination and Minor Fine-tuning | 2024 | Structured; Saliency | Link |
Sheared llama: Accelerating language model pre-training via structured pruning | 2023 | Structured; Optimization | Link |
NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models | 2023 | Structured | Link |
Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning | 2024 | Structured; PEFT | Link |
LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning | 2024 | Structured; PEFT | Link |
Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity | 2023 | Infrastructure | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
PyramidKV: Dynamic KV cache compression based on pyramidal information funneling | 2024 | KV Cache Compression | Link |
Quest: Query-aware sparsity for long-context transformers | 2024 | KV Cache Compression | Link |
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression | 2024 | KV Cache Compression | Link |
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration | 2024 | KV Cache Compression | Link |
Token merging: Your ViT but faster | 2023 | Cache Eviction | Link |
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization | 2024 | KV Cache Compression, Quantization | Link |
Layer-Condensed KV Cache for Efficient Inference of Large Language Models | 2024 | KV Cache Optimization, Memory Efficiency | Link |
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management | 2024 | Dynamic KV Cache Management, Long-Text Generation | Link |
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing | 2024 | KV Cache Sharing, Memory Optimization | Link |
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models | 2024 | KV Cache Compression, Depth Dimension | Link |
Unifying KV Cache Compression for Large Language Models with LeanKV | 2024 | KV Cache Compression, Unified Framework | Link |
Efficient LLM Inference with I/O-Aware Partial KV Cache Offloading | 2024 | KV Cache Offloading, I/O Optimization | Link |
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Offloading | 2024 | KV Cache Offloading, Serving Optimization | Link |
A Method for Building Large Language Models with Predefined KV Cache Capacity | 2024 | KV Cache Management, Predefined Capacity | Link |
D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models | 2024 | Dynamic Operations, KV Cache Optimization | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification | 2021 | Dynamic Token Sparsification | Link |
A-ViT: Adaptive Tokens for Efficient Vision Transformer | 2022 | Adaptive Tokens | Link |
Quest: Query-Aware Sparsity for Long-Context Transformers | 2024 | Query-Aware Sparsity | Link |
Token-Level Transformer Pruning | 2022 | Token Pruning | Link |
Token Merging: Your ViT but Faster | 2023 | Token Merging | Link |
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference | 2024 | Dynamic Token Pruning | Link |
Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache | 2024 | Token Oracle, Sparse-Quantized KV Cache | Link |
Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving | 2024 | Video Token Sparsification, Multimodal LLMs | Link |
Efficient Token Sparsification Through the Lens of Infused Knowledge | 2024 | Token Sparsification, Infused Knowledge | Link |
Tandem Transformers for Inference Efficient LLMs | 2024 | Tandem Transformers, Inference Efficiency | Link |
CHESS: Optimizing LLM Inference via Channel-Wise Activation Sparsification | 2024 | Activation Sparsification, LLM Optimization | Link |
FlightLLM: Efficient Large Language Model Inference with a Complete Compression Strategy | 2024 | Compression Strategy, LLM Inference | Link |
Sparse Expansion and Neuronal Disentanglement | 2024 | Sparse Expansion, Neuronal Disentanglement | Link |
Efficient LLM Inference using Dynamic Input Pruning and Activation Sparsity | 2024 | Input Pruning, Activation Sparsity | Link |
Sparsity for Efficient LLM Inference | 2024 | Sparsity Techniques, LLM Efficiency | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing | 2024 | Cache Eviction, Locality-Sensitive Hashing | Link |
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference | 2024 | Adaptive Budget Allocation, Cache Eviction | Link |
NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time | 2024 | General Framework, Cache Management | Link |
In-context KV-Cache Eviction for LLMs via Attention-Gate | 2024 | Attention-Gate, Dynamic Cache Eviction | Link |
Layer-Condensed KV Cache for Efficient Inference of Large Language Models | 2024 | Layer Compression, Memory Efficiency | Link |
D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models | 2024 | Dynamic Operations, KV Cache Optimization | Link |
H₂O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | 2023 | Heavy-Hitter Tokens, Cache Efficiency | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
FlashAttention: Fast and memory-efficient exact attention with I/O-awareness | 2022 | Cache Eviction | Link |
ZipVL: Efficient large vision-language models with dynamic token sparsification and KV cache compression | 2024 | Memory Management | Link |
Memory use of GPT-J 6B | 2021 | Memory Management | Link |
An Evolved Universal Transformer Memory | 2024 | Neural Attention Memory Models | Link |
Efficient Memory Management for Large Language Model Serving with PagedAttention | 2023 | PagedAttention | Link |
Recurrent Memory Transformer | 2023 | Recurrent Memory, Long Context Processing | Link |
SwiftTransformer: High Performance Transformer Implementation in C++ | 2023 | Model Parallelism, FlashAttention, PagedAttention | Link |
ChronoFormer: Memory-Efficient Transformer for Time Series Analysis | 2024 | Memory Efficiency, Time Series Data | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
REALM: Retrieval-Augmented Language Model Pre-Training | 2020 | Retrieval-Augmented Models | Link |
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks | 2020 | Retrieval-Augmented Models | Link |
BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models | 2023 | Model Efficiency, Token Compression | Link |
FlashBack: Efficient Retrieval-Augmented Language Modeling for Long Context Inference | 2024 | Long Context Inference, Model Efficiency | Link |
REPLUG: Retrieval-Augmented Black-Box Language Models | 2023 | Black-Box Models, Retrieval Integration | Link |
KG-Retriever: Efficient Knowledge Indexing for Retrieval-Augmented Large Language Models | 2024 | Knowledge Indexing, Large Language Models | Link |
Open-RAG: Enhanced Retrieval Augmented Reasoning with Open-Source Large Language Models | 2024 | Open-Source Models, Reasoning Enhancement | Link |
ERAGent: Enhancing Retrieval-Augmented Language Models with Improved Retrieval and Generation | 2024 | Retrieval Improvement, Generation Enhancement | Link |
RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit | 2023 | Toolkit, Model Integration | Link |
Making Retrieval-Augmented Language Models Robust to Irrelevant Documents | 2023 | Model Robustness, Document Relevance | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
Roformer: Enhanced Transformer with Rotary Position Embedding | 2021 | Positional Embedding | Link |
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned | 2019 | Self-Attention Analysis | Link |
Sixteen Heads Are Better Than One: Attention Mechanisms in Transformers | 2019 | Multi-Head Attention | Link |
TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices | 2023 | Tiny Devices, NAS | Link |
Efficient Transformers: A Survey | 2020 | Transformer Survey, Efficiency | Link |
Neural Architecture Search on Efficient Transformers and Beyond | 2022 | NAS, Transformer Optimization | Link |
TurboViT: Generating Fast Vision Transformers via Generative Architecture Search | 2023 | Vision Transformers, GAS | Link |
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing | 2020 | Hardware-Aware Design, NLP | Link |
Linformer: Self-Attention with Linear Complexity | 2020 | Attention Mechanisms | Link |
BigBird: Transformers for Longer Sequences | 2020 | Long-Sequence Transformers | Link |
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers | 2021 | Semantic Segmentation | Link |
MixFormerV2: Efficient Fully Transformer Tracking | 2023 | Object Tracking | Link |
Sparse-pruning for accelerating transformer models | 2023 | Pruning and Optimization | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
Mixtral of Experts | 2024 | Sparse MoE, Routing Networks | Link |
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts | 2024 | Time Series, Large-Scale Models | Link |
AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts | 2024 | Vision, Multi-Task Learning | Link |
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding | 2024 | Vision-Language, Multimodal Models | Link |
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts | 2024 | Multimodal LLMs, Unified Models | Link |
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models | 2024 | Language Models, Expert Specialization | Link |
OLMoE: Open Mixture-of-Experts Language Models | 2024 | Language Models, Open-Source MoE | Link |
MH-MoE: Multi-Head Mixture-of-Experts | 2024 | Multi-Head Attention, MoE | Link |
Mixture of LoRA Experts | 2024 | Parameter-Efficient Fine-Tuning, MoE | Link |
From Sparse to Soft Mixtures of Experts | 2024 | Soft MoE, Differentiable Routing | Link |
A Survey on Mixture of Experts | 2024 | Survey, MoE Architectures | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
Power-BERT: Accelerating BERT inference via progressive word-vector elimination | 2020 | Mixed Precision | Link |
Learning both weights and connections for efficient neural network | 2015 | Mixed Precision | Link |
Mixed Precision Training | 2018 | Training Optimization | Link |
Mixed Precision Training of Convolutional Neural Networks using Integer Operations | 2018 | Integer Operations, CNNs | Link |
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes | 2018 | Scalability, Training Systems | Link |
Guaranteed Approximation Bounds for Mixed-Precision Neural Operators | 2023 | Neural Operators, Approximation Theory | Link |
Training with Mixed-Precision Floating-Point Assignments | 2023 | Floating-Point Precision, Training Algorithms | Link |
Mixed-precision deep learning based on computational memory | 2020 | Computational Memory, Model Efficiency | Link |
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization | 2024 | Cache Compression, Mixed Precision Quantization | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
Effective methods for improving interpretability in reinforcement learning models | 2021 | Fine-Tuning | Link |
Efficient Fine-Tuning and Compression of Large Language Models with Low-Rank Adaptation and Pruning | 2023 | Fine-Tuning and Compression | Link |
Light-PEFT: Lightening parameter-efficient fine-tuning via early pruning | 2024 | Fine-Tuning | Link |
Prefix-Tuning: Optimizing Continuous Prompts for Generation | 2021 | Prompt Optimization | Link |
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-Tuning Universally Across Scales and Tasks | 2021 | Prompt Tuning | Link |
Parameter-Efficient Fine-Tuning of Large-Scale Pre-Trained Language Models | 2021 | Parameter-Efficient Tuning | Link |
Scaling Sparse Fine-Tuning to Large Language Models | 2024 | Sparse Fine-Tuning | Link |
Full Parameter Fine-Tuning for Large Language Models with Limited Resources | 2023 | Resource-Constrained Fine-Tuning | Link |
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning | 2024 | Memory-Efficient Fine-Tuning | Link |
Article Title | Year | Subfield | Link |
---|---|---|---|
Efficient Continual Pre-training for Building Domain Specific Large Language Models | 2024 | Continual Pretraining | Link |
GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values | 2024 | Model Optimization | Link |
Towards Effective and Efficient Continual Pre-training of Large Language Models | 2024 | Continual Pretraining | Link |
A Survey on Efficient Training of Transformers | 2023 | Training Optimization | Link |
Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization | 2024 | Quantization Techniques | Link |
Efficient Pre-training Objectives for Transformers | 2021 | Pretraining Objectives | Link |
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers | 2022 | Vision Transformers | Link |
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals | 2022 | Denoising Pretraining | Link |
Bucket Pre-training is All You Need | 2024 | Data Composition Strategies | Link |
FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models on High Performance Computing Systems | 2024 | HPC Systems | Link |