Step-by-step

Quantized Length Adaptive Transformer is based on Length Adaptive Transformer's work. Currently, it supports BERT and Reberta based transformers.

QuaLA-MiniLM: A Quantized Length Adaptive MiniLM has been accepted by NeurIPS 2022. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is trained only once, dynamically fits any inference scenario, and achieves an accuracy-efficiency trade-off superior to any other efficient approaches per any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1% accuracy loss). The following shows how to reproduce this work and we also provide the jupyter notebook tutorials.

Prerequisite

1. Environment

pip install intel-extension-for-transformers
pip install -r requirements.txt

Note: Suggest use PyTorch 1.12.0 and Intel Extension for PyTorch 1.12.0

Note: Suggest use transformers no higher than 4.34.1

Run

Step 1: Finetune

In this step, output/finetuning is a fine-tuned minilm for squad, which uploaded to sguskin/minilmv2-L6-H384-squad1.1

python run_qa.py \
--model_name_or_path nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large \
--dataset_name squad \
--do_train \
--do_eval \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--per_device_train_batch_size 8 \
--output_dir output/finetuning

Step 2: Training with LengthDrop

Train it with length-adaptive training to get the dynamic model output/finetuning which uploaded to sguskin/dynamic-minilmv2-L6-H384-squad1.1

python run_qa.py \
--model_name_or_path output/finetuning \
--dataset_name squad \
--do_train \
--do_eval \
--learning_rate 3e-5 \
--num_train_epochs 5 \
--max_seq_length 384 \
--doc_stride 128 \
--per_device_train_batch_size 8 \
--length_adaptive \
--num_sandwich 2  \
--length_drop_ratio_bound 0.2 \
--layer_dropout_prob 0.2 \
--output_dir output/dynamic

Step 3: Evolutionary Search

Run evolutionary search to optimize length configurations for any possible target computational budget.

python run_qa.py \
--model_name_or_path output/dynamic \
--dataset_name squad \
--max_seq_length 384 \
--doc_stride 128 \
--do_eval \
--per_device_eval_batch_size 32 \
--do_search \
--output_dir output/search

Step 4: Quantization

python run_qa.py \
--model_name_or_path "sguskin/dynamic-minilmv2-L6-H384-squad1.1" \
--dataset_name squad \
--quantization_approach PostTrainingStatic \
--do_eval \
--do_train \
--tune \
--output_dir output/quantized-dynamic-minilmv \
--overwrite_cache \
--per_device_eval_batch_size 32 \
--overwrite_output_dir

Step 5: Apply Length Config for Quantization

python run_qa.py \
--model_name_or_path "sguskin/dynamic-minilmv2-L6-H384-squad1.1" \  # used for load int8 model.
--dataset_name squad \
--do_eval \
--accuracy_only \
--int8 \
--output_dir output/quantized-dynamic-minilmv \  # used for load int8 model
--overwrite_cache \
--per_device_eval_batch_size 32 \
--length_config "(315, 251, 242, 159, 142, 33)"

Performance Data

Performance results test on 07/10/2022 with Intel Xeon Platinum 8280 Scalable processor, batchsize = 32 Performance varies by use, configuration and other factors. See platform configuration for configuration details. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

Model Name	Datatype	Optimization Method	Modelsize (MB)	InferenceResult
Model Name	Datatype	Optimization Method	Modelsize (MB)	Accuracy(F1)	Latency(ms)	GFLOPS**	Speedup (comparedwith BERT Base)
BERT Base	fp32	None	415.47	88.58	56.56	35.3	1x
TinyBERT	fp32	Distillation	253.20	88.39	32.40	17.7	1.75x
QuaTinyBERT	int8	Distillation + quantization	132.06	87.67	15.58	17.7	3.63x
MiniLMv2	fp32	Distillation	115.04	88.70	18.23	4.76	3.10x
QuaMiniLMv2	int8	Distillation + quantization	84.85	88.54	9.14	4.76	6.18x
LA-MiniLM	fp32	Drop and restore base MiniLMv2	115.04	89.28	16.99	4.76	3.33x
LA-MiniLM(269, 253, 252, 202, 104, 34)*	fp32	Evolution search (best config)	115.04	87.76	11.44	2.49	4.94x
QuaLA-MiniLM	int8	Quantization base LA-MiniLM	84.85	88.85	7.84	4.76	7.21x
QuaLA-MiniLM(315,251,242,159,142,33)*	int8	Evolution search (best config)	84.86	87.68	6.41	2.55	8.82x

NOTES: * length config apply to LA model

NOTES: ** the multiplication and addition operation amount when model inference (GFLOPS is obtained from torchprofile tool)

Platform Configuration

Manufacturer	Intel Corporation
Product Name	S2600WFD
BIOS Version	1SE5C620.86B.02.01.0008.031920191559
OS	CentOS Linux release 8.4.2105
Kernel	4.18.0-305.3.1.el8.x86_64
Microcode	0x5003006
IRQ Balance	Enabled
CPU Model	Intel(R) Xeon Platinum 8280 CPU @ 2.70GHz
Base Frequency	2.7GHz
Maximum Frequency	4.0GHz
All-core Maximum Frequency	3.3GHz
CPU(s)	112
Thread(s) per Core	2
Core(s) per Socket	28
Socket(s)	2
NUMA Node(s)	2
Turbo	Enabled
FrequencyGoverner	Performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Step-by-step

Prerequisite

1. Environment

Run

Step 1: Finetune

Step 2: Training with LengthDrop

Step 3: Evolutionary Search

Step 4: Quantization

Step 5: Apply Length Config for Quantization

Performance Data

Platform Configuration

Files

README.md

Latest commit

History

README.md

File metadata and controls

Step-by-step

Prerequisite​

1. Environment

Run

Step 1: Finetune

Step 2: Training with LengthDrop

Step 3: Evolutionary Search

Step 4: Quantization

Step 5: Apply Length Config for Quantization

Performance Data

Platform Configuration

Prerequisite