-
⭐Post-Training Quantization for Vision Transformer - PKU & Huawei Noah’s Ark Lab,
NIPS 2021
-
⭐PTQ4ViT: Post-Training Quantization Framework for Vision Transformers - Houmo AI & PKU,
ECCV 2021
-
⭐FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer - MEGVII Technology,
IJCAI 2022
-
Q-ViT: Fully Differentiable Quantization for Vision Transformer - Megvii Technology & CASIA,
arxiv 2022
-
TerViT: An Efficient Ternary Vision Transformer - Beihang University & Shanghai Artificial Intelligence Laboratory,
arxiv 2022
-
Patch Similarity Aware Data-Free Quantization for Vision Transformers - CASIA,
ECCV 2022
-
PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers - CASIA,
arxiv 2022
-
NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers - NJU & UCB & PKU,
arxiv 2022
-
Q8BERT: Quantized 8Bit BERT - Intel AI Lab,
NIPS Workshop 2019
-
Ternarybert: Distillation-aware ultra-low bit bert - Huawei Noah’s Ark Lab,
EMNLP 2020
-
⭐I-BERT: Integer-only BERT Quantization - University of California, Berkeley,
ICML 2021
-
⭐Understanding and Overcoming the Challenges of Efficient Transformer Quantization - Qualcomm AI Research,
EMNLP 2021
-
Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training - NVIDIA,
ICML 2022
-
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers - Microsoft,
arxiv 2022
-
Outlier Suppression: Pushing the Limit of Low-bit Transformer - BUAA & SenseTime & PKU & UESTC,
NIPS 2022
-
Compression of Generative Pre-trained Language Models via Quantization - The University of Hong Kong & Huawei Noah’s Ark Lab,
ACL 2022
-
NUQMM: QUANTIZED MATMUL FOR EFFICIENT INFERENCE OF LARGE-SCALE GENERATIVE LANGUAGE MODELS - Pohang University of Science and Technology,
arxiv 2022
-
⭐LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale - University of Washington & FAIR,
NIPS 2022
-
⭐SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - MIT,
arxiv 2022
-
GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS - IST Austria & ETH Zurich,
arxiv 2022
-
The case for 4-bit precision: k-bit Inference Scaling Laws - University of Washington,
arxiv 2022
-
Quadapter: Adapter for GPT-2 Quantization - Qualcomm AI Research,
arxiv 2022
-
A Comprehensive Study on Post-Training Quantization for Large Language Models - Microsoft,
arxiv 2023
-
RPTQ: Reorder-based Post-training Quantization for Large Language Models - Houmo AI,
arxiv 2023
-
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling - BUAA & SenseTime & PKU & UESTC,
arxiv 2023
-
⭐QLORA: Efficient Finetuning of Quantized LLMs - University of Washington,
arxiv 2023
-
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - MIT,
arxiv 2023
-
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time - Rice University,
arxiv 2023
-
A Fast Post-Training Pruning Framework for Transformers - UC Berkeley,
arxiv 2022
-
SPARSEGPT: MASSIVE LANGUAGE MODELS CAN BE ACCURATELY PRUNED IN ONE-SHOT - IST Austria,
arxiv 2023
-
WHAT MATTERS IN THE STRUCTURED PRUNING OF GENERATIVE LANGUAGE MODELS? - CMU & Microsoft,
arxiv 2023
-
ZipLM: Hardware-Aware Structured Pruning of Language Models - IST Austria,
arxiv 2023
-
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale - Microsoft,
arxiv 2022
-
PETALS: Collaborative Inference and Fine-tuning of Large Models - Yandex,
arxiv 2022
-
EFFICIENTLY SCALING TRANSFORMER INFERENCE - Google,
arxiv 2022
-
⭐High-throughput Generative Inference of Large Language Models with a Single GPU - Stanford etc.,
arxiv 2023
-
Accelerating Large Language Model Decoding with Speculative Sampling - Deep Mind,
arxiv 2023
-
⭐SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification - CMU & UCSD,
arxiv 2023
-
vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention - UCB
-
A3: Accelerating Attention Mechanisms in Neural Networks with Approximation - Seoul National University & Hynix,
HPCA 2020
-
ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks - Seoul National University,
ISCA 2021
-
Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization - ECNU,
ICCAD 2022
-
Accelerating attention through gradient-based learned runtime pruning - UCSD & Google,
ISCA 2022
-
Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks - IBM,
NIPS 2019
-
FP8 Quantization: The Power of the Exponent - Qualcomm AI Research,
arxiv 2022
-
FP8 Formats for Deep Learning - NVIDIA & ARM & Intel,
arxiv 2022
updating ...