GitHub - Ther-nullptr/Awesome-Transformer-Accleration: Paper list for accleration of transformers

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md

Repository files navigation

Software Algorithms -- Quantization

ViT

⭐Post-Training Quantization for Vision Transformer - PKU & Huawei Noah’s Ark Lab, NIPS 2021
⭐PTQ4ViT: Post-Training Quantization Framework for Vision Transformers - Houmo AI & PKU, ECCV 2021
⭐FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer - MEGVII Technology, IJCAI 2022
Q-ViT: Fully Differentiable Quantization for Vision Transformer - Megvii Technology & CASIA, arxiv 2022
TerViT: An Efficient Ternary Vision Transformer - Beihang University & Shanghai Artificial Intelligence Laboratory, arxiv 2022
Patch Similarity Aware Data-Free Quantization for Vision Transformers - CASIA, ECCV 2022
PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers - CASIA, arxiv 2022
NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers - NJU & UCB & PKU, arxiv 2022

BERT

Q8BERT: Quantized 8Bit BERT - Intel AI Lab, NIPS Workshop 2019
Ternarybert: Distillation-aware ultra-low bit bert - Huawei Noah’s Ark Lab, EMNLP 2020
⭐I-BERT: Integer-only BERT Quantization - University of California, Berkeley, ICML 2021
⭐Understanding and Overcoming the Challenges of Efficient Transformer Quantization - Qualcomm AI Research, EMNLP 2021
Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training - NVIDIA, ICML 2022
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers - Microsoft, arxiv 2022
Outlier Suppression: Pushing the Limit of Low-bit Transformer - BUAA & SenseTime & PKU & UESTC, NIPS 2022

GPT

Compression of Generative Pre-trained Language Models via Quantization - The University of Hong Kong & Huawei Noah’s Ark Lab, ACL 2022
NUQMM: QUANTIZED MATMUL FOR EFFICIENT INFERENCE OF LARGE-SCALE GENERATIVE LANGUAGE MODELS - Pohang University of Science and Technology, arxiv 2022
⭐LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale - University of Washington & FAIR, NIPS 2022
⭐SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - MIT, arxiv 2022
GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS - IST Austria & ETH Zurich, arxiv 2022
The case for 4-bit precision: k-bit Inference Scaling Laws - University of Washington, arxiv 2022
Quadapter: Adapter for GPT-2 Quantization - Qualcomm AI Research, arxiv 2022
A Comprehensive Study on Post-Training Quantization for Large Language Models - Microsoft, arxiv 2023
RPTQ: Reorder-based Post-training Quantization for Large Language Models - Houmo AI, arxiv 2023
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling - BUAA & SenseTime & PKU & UESTC, arxiv 2023
⭐QLORA: Efficient Finetuning of Quantized LLMs - University of Washington, arxiv 2023
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - MIT, arxiv 2023
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time - Rice University, arxiv 2023

Software Algorithms -- Pruning

A Fast Post-Training Pruning Framework for Transformers - UC Berkeley, arxiv 2022
SPARSEGPT: MASSIVE LANGUAGE MODELS CAN BE ACCURATELY PRUNED IN ONE-SHOT - IST Austria, arxiv 2023
WHAT MATTERS IN THE STRUCTURED PRUNING OF GENERATIVE LANGUAGE MODELS? - CMU & Microsoft, arxiv 2023
ZipLM: Hardware-Aware Structured Pruning of Language Models - IST Austria, arxiv 2023

System Implementations -- Machine Learning System

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale - Microsoft, arxiv 2022
PETALS: Collaborative Inference and Fine-tuning of Large Models - Yandex, arxiv 2022
EFFICIENTLY SCALING TRANSFORMER INFERENCE - Google, arxiv 2022
⭐High-throughput Generative Inference of Large Language Models with a Single GPU - Stanford etc., arxiv 2023
Accelerating Large Language Model Decoding with Speculative Sampling - Deep Mind, arxiv 2023
⭐SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification - CMU & UCSD, arxiv 2023
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - UCB

Hardware Implementations -- Acclerators

A3: Accelerating Attention Mechanisms in Neural Networks with Approximation - Seoul National University & Hynix, HPCA 2020
ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks - Seoul National University, ISCA 2021
Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization - ECNU, ICCAD 2022
Accelerating attention through gradient-based learned runtime pruning - UCSD & Google, ISCA 2022

Hardware Implementations -- Datatype

Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks - IBM, NIPS 2019
FP8 Quantization: The Power of the Exponent - Qualcomm AI Research, arxiv 2022
FP8 Formats for Deep Learning - NVIDIA & ARM & Intel, arxiv 2022

updating ...

About

Paper list for accleration of transformers

deep-learning transformer papers

Report repository

Releases

No releases published

Packages

No packages published