Quantized Length Adaptive Transformer is based on Length Adaptive Transformer's work. Currently, it supports BERT and Reberta based transformers.
QuaLA-MiniLM: A Quantized Length Adaptive MiniLM has been accepted by NeurIPS 2022. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is trained only once, dynamically fits any inference scenario, and achieves an accuracy-efficiency trade-off superior to any other efficient approaches per any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1% accuracy loss). The following shows how to reproduce this work and we also provide the jupyter notebook tutorials.
pip install intel-extension-for-transformers
pip install -r requirements.txt
Note: Suggest use PyTorch 1.12.0 and Intel Extension for PyTorch 1.12.0
Note: Suggest use transformers no higher than 4.34.1
In this step, output/finetuning
is a fine-tuned minilm for squad, which uploaded to sguskin/minilmv2-L6-H384-squad1.1
python run_qa.py \
--model_name_or_path nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large \
--dataset_name squad \
--do_train \
--do_eval \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--per_device_train_batch_size 8 \
--output_dir output/finetuning
Train it with length-adaptive training to get the dynamic model output/finetuning
which uploaded to sguskin/dynamic-minilmv2-L6-H384-squad1.1
python run_qa.py \
--model_name_or_path output/finetuning \
--dataset_name squad \
--do_train \
--do_eval \
--learning_rate 3e-5 \
--num_train_epochs 5 \
--max_seq_length 384 \
--doc_stride 128 \
--per_device_train_batch_size 8 \
--length_adaptive \
--num_sandwich 2 \
--length_drop_ratio_bound 0.2 \
--layer_dropout_prob 0.2 \
--output_dir output/dynamic
Run evolutionary search to optimize length configurations for any possible target computational budget.
python run_qa.py \
--model_name_or_path output/dynamic \
--dataset_name squad \
--max_seq_length 384 \
--doc_stride 128 \
--do_eval \
--per_device_eval_batch_size 32 \
--do_search \
--output_dir output/search
python run_qa.py \
--model_name_or_path "sguskin/dynamic-minilmv2-L6-H384-squad1.1" \
--dataset_name squad \
--quantization_approach PostTrainingStatic \
--do_eval \
--do_train \
--tune \
--output_dir output/quantized-dynamic-minilmv \
--overwrite_cache \
--per_device_eval_batch_size 32 \
--overwrite_output_dir
python run_qa.py \
--model_name_or_path "sguskin/dynamic-minilmv2-L6-H384-squad1.1" \ # used for load int8 model.
--dataset_name squad \
--do_eval \
--accuracy_only \
--int8 \
--output_dir output/quantized-dynamic-minilmv \ # used for load int8 model
--overwrite_cache \
--per_device_eval_batch_size 32 \
--length_config "(315, 251, 242, 159, 142, 33)"
Performance results test on 07/10/2022 with Intel Xeon Platinum 8280 Scalable processor, batchsize = 32 Performance varies by use, configuration and other factors. See platform configuration for configuration details. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Model Name |
Datatype | Optimization Method |
Modelsize (MB) |
InferenceResult |
|||
---|---|---|---|---|---|---|---|
Accuracy(F1) |
Latency(ms) |
GFLOPS** |
Speedup (comparedwith BERT Base) |
||||
BERT Base |
fp32 | None |
415.47 |
88.58 |
56.56 |
35.3 |
1x |
TinyBERT |
fp32 | Distillation |
253.20 |
88.39 |
32.40 |
17.7 |
1.75x |
QuaTinyBERT |
int8 | Distillation + quantization |
132.06 |
87.67 |
15.58 |
17.7 |
3.63x |
MiniLMv2 |
fp32 | Distillation |
115.04 |
88.70 |
18.23 |
4.76 |
3.10x |
QuaMiniLMv2 |
int8 | Distillation + quantization |
84.85 |
88.54 |
9.14 |
4.76 |
6.18x |
LA-MiniLM |
fp32 | Drop and restore base MiniLMv2 |
115.04 |
89.28 |
16.99 |
4.76 |
3.33x |
LA-MiniLM(269, 253, 252, 202, 104, 34)* |
fp32 | Evolution search (best config) |
115.04 |
87.76 |
11.44 |
2.49 |
4.94x |
QuaLA-MiniLM |
int8 | Quantization base LA-MiniLM |
84.85 |
88.85 |
7.84 |
4.76 |
7.21x |
QuaLA-MiniLM(315,251,242,159,142,33)* |
int8 | Evolution search (best config) |
84.86 |
87.68 |
6.41 |
2.55 |
8.82x |
NOTES: ** the multiplication and addition operation amount when model inference (GFLOPS is obtained from torchprofile tool)
Manufacturer | Intel Corporation |
Product Name | S2600WFD |
BIOS Version | 1SE5C620.86B.02.01.0008.031920191559 |
OS | CentOS Linux release 8.4.2105 |
Kernel | 4.18.0-305.3.1.el8.x86_64 |
Microcode | 0x5003006 |
IRQ Balance | Enabled |
CPU Model | Intel(R) Xeon Platinum 8280 CPU @ 2.70GHz |
Base Frequency | 2.7GHz |
Maximum Frequency | 4.0GHz |
All-core Maximum Frequency | 3.3GHz |
CPU(s) | 112 |
Thread(s) per Core | 2 |
Core(s) per Socket | 28 |
Socket(s) | 2 |
NUMA Node(s) | 2 |
Turbo | Enabled |
FrequencyGoverner | Performance |