Skip to content

Efficient-Large-Foundation-Model-Inference: A-Perspective-From-Model-and-System-Co-Design [Efficient ML System & Model]

Notifications You must be signed in to change notification settings

NoakLiu/Efficient-Foundation-Models-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 

Repository files navigation

Survey on Efficient Large Foundation Models Design: A Perspective From Model and System Co-Design

Efficient Foundation Models Overview

Survey Overview

  1. Model Design
  2. System Design
  3. Model-Sys Co-Design

Model Design

Quantization

Article Title Year Subfield Link
Quarot: Outlier-free 4-bit inference in rotated LLMs 2024 Quantization Link
Sparse-quantized representation for near-lossless LLM weight compression 2023 Quantization Link
GPTQ: Accurate post-training quantization for generative pre-trained transformers 2022 Quantization Link
Llama.int8(): 8-bit matrix multiplication for transformers at scale 2022 Quantization Link
Llm-qat: Data-free quantization aware training for large language models 2023 Quantization Link
Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation 2024 Quantization Link
OneBit: Towards Extremely Low-bit Large Language Models 2024 Quantization Link
QLoRA: efficient finetuning of quantized LLMs 2023 Quantization Link
Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization 2023 Quantization Link
Loftq: Lora-fine-tuning-aware quantization for large language models 2023 Quantization Link
L4q: Parameter efficient quantization-aware training on large language models via lora-wise lsq 2024 Quantization Link
Qa-lora: Quantization-aware low-rank adaptation of large language models 2023 Quantization Link
Efficientqat: Efficient quantization-aware training for large language models 2024 Quantization Link
Accurate lora-finetuning quantization of llms via information retention 2024 Quantization Link
EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge 2024 Quantization Link
Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models 2022 Quantization Link
SpQR: A sparse-quantized representation for near-lossless llm weight compression 2023 Quantization Link
Olive: Accelerating large language models via hardware-friendly outlier victim pair quantization 2023 Quantization Link
Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks 2024 Quantization Link
Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation 2023 Quantization Link
Smoothquant: Accurate and efficient post-training quantization for large language models 2023 Quantization Link
Spinquant–llm quantization with learned rotations 2024 Quantization Link
Atom: Low-bit quantization for efficient and accurate llm serving 2024 Quantization Link
Flexgen: High-throughput generative inference of large language models with a single gpu 2024 Quantization Link
Compression of generative pre-trained language models via quantization 2022 Quantization Link

Distillation

Article Title Year Subfield Link
On-policy distillation of language models: Learning from self-generated mistakes 2024 Knowledge Distillation Link
Knowledge distillation for closed-source language models 2024 Knowledge Distillation Link
Flamingo: A visual language model for few-shot learning 2022 Knowledge Distillation Link
In-context learning distillation: Transferring few-shot learning ability of pre-trained language models 2022 Knowledge Distillation Link
Less is more: Task-aware layer-wise distillation for language model compression 2022 Knowledge Distillation Link
Knowledge Distillation for Closed-Source Language Models 2024 Knowledge Distillation Link
MiniLLM: Knowledge distillation of large language models 2024 Knowledge Distillation Link
Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty 2023 Knowledge Distillation Link
Propagating knowledge updates to lms through distillation 2023 Knowledge Distillation Link
On-policy distillation of language models: Learning from self-generated mistakes 2024 Knowledge Distillation Link
Metaicl: Learning to learn in context 2022 Knowledge Distillation Link
In-context learning distillation: Transferring few-shot learning ability of pre-trained language models 2022 Knowledge Distillation Link
Multistage collaborative knowledge distillation from a large language model for semi-supervised sequence generation 2024 Knowledge Distillation Link
Layer-wise Knowledge Distillation for BERT Fine-Tuning 2022 Knowledge Distillation Link

Pruning

Article Title Year Subfield Link
SparseGPT: Massive language models can be accurately pruned in one-shot 2023 Unstructured; Saliency Link
A simple and effective pruning approach for large language models 2023 Unstructured; Saliency Link
One-shot sensitivity-aware mixed sparsity pruning for large language models 2024 Unstructured; Saliency Link
Pruning as a Domain-specific LLM Extractor 2024 Unstructured; Saliency Link
Plug-and-play: An efficient post-training pruning method for large language models 2024 Unstructured; Saliency Link
Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models 2024 Unstructured; Saliency Link
Prune and tune: Improving efficient pruning techniques for massive language models 2023 Unstructured; Saliency Link
Language-Specific Pruning for Efficient Reduction of Large Language Models 2024 Unstructured; Saliency Link
Besa: Pruning large language models with blockwise parameter-efficient sparsity allocation 2024 Unstructured; Optimization Link
LLM-Pruner: On the structural pruning of large language models 2023 Structured; Saliency Link
Fluctuation-based adaptive structured pruning for large language models 2024 Structured; Saliency Link
Structured Pruning for Large Language Models Using Coupled Components Elimination and Minor Fine-tuning 2024 Structured; Saliency Link
Sheared llama: Accelerating language model pre-training via structured pruning 2023 Structured; Optimization Link
NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models 2023 Structured Link
Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning 2024 Structured; PEFT Link
LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning 2024 Structured; PEFT Link
Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity 2023 Infrastructure Link

System Design

K-V Cache

Article Title Year Subfield Link
PyramidKV: Dynamic KV cache compression based on pyramidal information funneling 2024 KV Cache Compression Link
Quest: Query-aware sparsity for long-context transformers 2024 KV Cache Compression Link
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression 2024 KV Cache Compression Link
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration 2024 KV Cache Compression Link
Token merging: Your ViT but faster 2023 Cache Eviction Link
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization 2024 KV Cache Compression, Quantization Link
Layer-Condensed KV Cache for Efficient Inference of Large Language Models 2024 KV Cache Optimization, Memory Efficiency Link
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management 2024 Dynamic KV Cache Management, Long-Text Generation Link
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing 2024 KV Cache Sharing, Memory Optimization Link
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models 2024 KV Cache Compression, Depth Dimension Link
Unifying KV Cache Compression for Large Language Models with LeanKV 2024 KV Cache Compression, Unified Framework Link
Efficient LLM Inference with I/O-Aware Partial KV Cache Offloading 2024 KV Cache Offloading, I/O Optimization Link
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Offloading 2024 KV Cache Offloading, Serving Optimization Link
A Method for Building Large Language Models with Predefined KV Cache Capacity 2024 KV Cache Management, Predefined Capacity Link
D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models 2024 Dynamic Operations, KV Cache Optimization Link

Token Sparsification

Article Title Year Subfield Link
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification 2021 Dynamic Token Sparsification Link
A-ViT: Adaptive Tokens for Efficient Vision Transformer 2022 Adaptive Tokens Link
Quest: Query-Aware Sparsity for Long-Context Transformers 2024 Query-Aware Sparsity Link
Token-Level Transformer Pruning 2022 Token Pruning Link
Token Merging: Your ViT but Faster 2023 Token Merging Link
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference 2024 Dynamic Token Pruning Link
Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache 2024 Token Oracle, Sparse-Quantized KV Cache Link
Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving 2024 Video Token Sparsification, Multimodal LLMs Link
Efficient Token Sparsification Through the Lens of Infused Knowledge 2024 Token Sparsification, Infused Knowledge Link
Tandem Transformers for Inference Efficient LLMs 2024 Tandem Transformers, Inference Efficiency Link
CHESS: Optimizing LLM Inference via Channel-Wise Activation Sparsification 2024 Activation Sparsification, LLM Optimization Link
FlightLLM: Efficient Large Language Model Inference with a Complete Compression Strategy 2024 Compression Strategy, LLM Inference Link
Sparse Expansion and Neuronal Disentanglement 2024 Sparse Expansion, Neuronal Disentanglement Link
Efficient LLM Inference using Dynamic Input Pruning and Activation Sparsity 2024 Input Pruning, Activation Sparsity Link
Sparsity for Efficient LLM Inference 2024 Sparsity Techniques, LLM Efficiency Link

Efficient Cache Eviction

Article Title Year Subfield Link
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing 2024 Cache Eviction, Locality-Sensitive Hashing Link
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference 2024 Adaptive Budget Allocation, Cache Eviction Link
NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time 2024 General Framework, Cache Management Link
In-context KV-Cache Eviction for LLMs via Attention-Gate 2024 Attention-Gate, Dynamic Cache Eviction Link
Layer-Condensed KV Cache for Efficient Inference of Large Language Models 2024 Layer Compression, Memory Efficiency Link
D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models 2024 Dynamic Operations, KV Cache Optimization Link
H₂O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models 2023 Heavy-Hitter Tokens, Cache Efficiency Link

Memory Management

Article Title Year Subfield Link
FlashAttention: Fast and memory-efficient exact attention with I/O-awareness 2022 Cache Eviction Link
ZipVL: Efficient large vision-language models with dynamic token sparsification and KV cache compression 2024 Memory Management Link
Memory use of GPT-J 6B 2021 Memory Management Link
An Evolved Universal Transformer Memory 2024 Neural Attention Memory Models Link
Efficient Memory Management for Large Language Model Serving with PagedAttention 2023 PagedAttention Link
Recurrent Memory Transformer 2023 Recurrent Memory, Long Context Processing Link
SwiftTransformer: High Performance Transformer Implementation in C++ 2023 Model Parallelism, FlashAttention, PagedAttention Link
ChronoFormer: Memory-Efficient Transformer for Time Series Analysis 2024 Memory Efficiency, Time Series Data Link

Efficient Retrieval-Augmented Models

Article Title Year Subfield Link
REALM: Retrieval-Augmented Language Model Pre-Training 2020 Retrieval-Augmented Models Link
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks 2020 Retrieval-Augmented Models Link
BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models 2023 Model Efficiency, Token Compression Link
FlashBack: Efficient Retrieval-Augmented Language Modeling for Long Context Inference 2024 Long Context Inference, Model Efficiency Link
REPLUG: Retrieval-Augmented Black-Box Language Models 2023 Black-Box Models, Retrieval Integration Link
KG-Retriever: Efficient Knowledge Indexing for Retrieval-Augmented Large Language Models 2024 Knowledge Indexing, Large Language Models Link
Open-RAG: Enhanced Retrieval Augmented Reasoning with Open-Source Large Language Models 2024 Open-Source Models, Reasoning Enhancement Link
ERAGent: Enhancing Retrieval-Augmented Language Models with Improved Retrieval and Generation 2024 Retrieval Improvement, Generation Enhancement Link
RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit 2023 Toolkit, Model Integration Link
Making Retrieval-Augmented Language Models Robust to Irrelevant Documents 2023 Model Robustness, Document Relevance Link

Efficient Transformer Architecture Design

Article Title Year Subfield Link
Roformer: Enhanced Transformer with Rotary Position Embedding 2021 Positional Embedding Link
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned 2019 Self-Attention Analysis Link
Sixteen Heads Are Better Than One: Attention Mechanisms in Transformers 2019 Multi-Head Attention Link
TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices 2023 Tiny Devices, NAS Link
Efficient Transformers: A Survey 2020 Transformer Survey, Efficiency Link
Neural Architecture Search on Efficient Transformers and Beyond 2022 NAS, Transformer Optimization Link
TurboViT: Generating Fast Vision Transformers via Generative Architecture Search 2023 Vision Transformers, GAS Link
HAT: Hardware-Aware Transformers for Efficient Natural Language Processing 2020 Hardware-Aware Design, NLP Link
Linformer: Self-Attention with Linear Complexity 2020 Attention Mechanisms Link
BigBird: Transformers for Longer Sequences 2020 Long-Sequence Transformers Link
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers 2021 Semantic Segmentation Link
MixFormerV2: Efficient Fully Transformer Tracking 2023 Object Tracking Link
Sparse-pruning for accelerating transformer models 2023 Pruning and Optimization Link

Model-Sys Co-Design

MoE

Article Title Year Subfield Link
Mixtral of Experts 2024 Sparse MoE, Routing Networks Link
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts 2024 Time Series, Large-Scale Models Link
AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts 2024 Vision, Multi-Task Learning Link
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding 2024 Vision-Language, Multimodal Models Link
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts 2024 Multimodal LLMs, Unified Models Link
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models 2024 Language Models, Expert Specialization Link
OLMoE: Open Mixture-of-Experts Language Models 2024 Language Models, Open-Source MoE Link
MH-MoE: Multi-Head Mixture-of-Experts 2024 Multi-Head Attention, MoE Link
Mixture of LoRA Experts 2024 Parameter-Efficient Fine-Tuning, MoE Link
From Sparse to Soft Mixtures of Experts 2024 Soft MoE, Differentiable Routing Link
A Survey on Mixture of Experts 2024 Survey, MoE Architectures Link

Mixed Precision Training

Article Title Year Subfield Link
Power-BERT: Accelerating BERT inference via progressive word-vector elimination 2020 Mixed Precision Link
Learning both weights and connections for efficient neural network 2015 Mixed Precision Link
Mixed Precision Training 2018 Training Optimization Link
Mixed Precision Training of Convolutional Neural Networks using Integer Operations 2018 Integer Operations, CNNs Link
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes 2018 Scalability, Training Systems Link
Guaranteed Approximation Bounds for Mixed-Precision Neural Operators 2023 Neural Operators, Approximation Theory Link
Training with Mixed-Precision Floating-Point Assignments 2023 Floating-Point Precision, Training Algorithms Link
Mixed-precision deep learning based on computational memory 2020 Computational Memory, Model Efficiency Link
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization 2024 Cache Compression, Mixed Precision Quantization Link

Fine-Tuning with System Optimization

Article Title Year Subfield Link
Effective methods for improving interpretability in reinforcement learning models 2021 Fine-Tuning Link
Efficient Fine-Tuning and Compression of Large Language Models with Low-Rank Adaptation and Pruning 2023 Fine-Tuning and Compression Link
Light-PEFT: Lightening parameter-efficient fine-tuning via early pruning 2024 Fine-Tuning Link
Prefix-Tuning: Optimizing Continuous Prompts for Generation 2021 Prompt Optimization Link
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-Tuning Universally Across Scales and Tasks 2021 Prompt Tuning Link
Parameter-Efficient Fine-Tuning of Large-Scale Pre-Trained Language Models 2021 Parameter-Efficient Tuning Link
Scaling Sparse Fine-Tuning to Large Language Models 2024 Sparse Fine-Tuning Link
Full Parameter Fine-Tuning for Large Language Models with Limited Resources 2023 Resource-Constrained Fine-Tuning Link
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning 2024 Memory-Efficient Fine-Tuning Link

Efficient Pretraining

Article Title Year Subfield Link
Efficient Continual Pre-training for Building Domain Specific Large Language Models 2024 Continual Pretraining Link
GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values 2024 Model Optimization Link
Towards Effective and Efficient Continual Pre-training of Large Language Models 2024 Continual Pretraining Link
A Survey on Efficient Training of Transformers 2023 Training Optimization Link
Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization 2024 Quantization Techniques Link
Efficient Pre-training Objectives for Transformers 2021 Pretraining Objectives Link
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers 2022 Vision Transformers Link
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals 2022 Denoising Pretraining Link
Bucket Pre-training is All You Need 2024 Data Composition Strategies Link
FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models on High Performance Computing Systems 2024 HPC Systems Link

About

Efficient-Large-Foundation-Model-Inference: A-Perspective-From-Model-and-System-Co-Design [Efficient ML System & Model]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published