Skip to content

Latest commit

 

History

History
308 lines (278 loc) · 7.4 KB

File metadata and controls

308 lines (278 loc) · 7.4 KB

Step-by-step

Quantized Length Adaptive Transformer is based on Length Adaptive Transformer's work. Currently, it supports BERT and Reberta based transformers.

QuaLA-MiniLM: A Quantized Length Adaptive MiniLM has been accepted by NeurIPS 2022. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is trained only once, dynamically fits any inference scenario, and achieves an accuracy-efficiency trade-off superior to any other efficient approaches per any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1% accuracy loss). The following shows how to reproduce this work and we also provide the jupyter notebook tutorials.

Prerequisite​

1. Environment

pip install intel-extension-for-transformers
pip install -r requirements.txt

Note: Suggest use PyTorch 1.12.0 and Intel Extension for PyTorch 1.12.0

Note: Suggest use transformers no higher than 4.34.1

Run

Step 1: Finetune

In this step, output/finetuning is a fine-tuned minilm for squad, which uploaded to sguskin/minilmv2-L6-H384-squad1.1

python run_qa.py \
--model_name_or_path nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large \
--dataset_name squad \
--do_train \
--do_eval \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--per_device_train_batch_size 8 \
--output_dir output/finetuning

Step 2: Training with LengthDrop

Train it with length-adaptive training to get the dynamic model output/finetuning which uploaded to sguskin/dynamic-minilmv2-L6-H384-squad1.1

python run_qa.py \
--model_name_or_path output/finetuning \
--dataset_name squad \
--do_train \
--do_eval \
--learning_rate 3e-5 \
--num_train_epochs 5 \
--max_seq_length 384 \
--doc_stride 128 \
--per_device_train_batch_size 8 \
--length_adaptive \
--num_sandwich 2  \
--length_drop_ratio_bound 0.2 \
--layer_dropout_prob 0.2 \
--output_dir output/dynamic 

Step 3: Evolutionary Search

Run evolutionary search to optimize length configurations for any possible target computational budget.

python run_qa.py \
--model_name_or_path output/dynamic \
--dataset_name squad \
--max_seq_length 384 \
--doc_stride 128 \
--do_eval \
--per_device_eval_batch_size 32 \
--do_search \
--output_dir output/search

Step 4: Quantization

python run_qa.py \
--model_name_or_path "sguskin/dynamic-minilmv2-L6-H384-squad1.1" \
--dataset_name squad \
--quantization_approach PostTrainingStatic \
--do_eval \
--do_train \
--tune \
--output_dir output/quantized-dynamic-minilmv \
--overwrite_cache \
--per_device_eval_batch_size 32 \
--overwrite_output_dir

Step 5: Apply Length Config for Quantization

python run_qa.py \
--model_name_or_path "sguskin/dynamic-minilmv2-L6-H384-squad1.1" \  # used for load int8 model.
--dataset_name squad \
--do_eval \
--accuracy_only \
--int8 \
--output_dir output/quantized-dynamic-minilmv \  # used for load int8 model
--overwrite_cache \
--per_device_eval_batch_size 32 \
--length_config "(315, 251, 242, 159, 142, 33)"

Performance Data

Performance results test on ​​07/10/2022 with Intel Xeon Platinum 8280 Scalable processor, batchsize = 32 Performance varies by use, configuration and other factors. See platform configuration for configuration details. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks


Model Name
Datatype
Optimization Method



Modelsize (MB)

InferenceResult

Accuracy(F1)

Latency(ms)

GFLOPS**

Speedup

(comparedwith BERT Base)

BERT Base
fp32
None

415.47

88.58

56.56

35.3

1x

TinyBERT
fp32
Distillation

253.20

88.39

32.40

17.7

1.75x

QuaTinyBERT
int8
Distillation + quantization

132.06

87.67

15.58

17.7

3.63x

MiniLMv2
fp32
Distillation

115.04

88.70

18.23

4.76

3.10x

QuaMiniLMv2
int8
Distillation + quantization

84.85

88.54

9.14

4.76

6.18x

LA-MiniLM
fp32
Drop and restore base MiniLMv2

115.04

89.28

16.99

4.76

3.33x

LA-MiniLM(269, 253, 252, 202, 104, 34)*
fp32
Evolution search (best config)

115.04

87.76

11.44

2.49

4.94x

QuaLA-MiniLM
int8
Quantization base LA-MiniLM

84.85

88.85

7.84

4.76

7.21x

QuaLA-MiniLM(315,251,242,159,142,33)*
int8
Evolution search (best config)

84.86

87.68

6.41

2.55

8.82x
NOTES: * length config apply to LA model

NOTES: ** the multiplication and addition operation amount when model inference (GFLOPS is obtained from torchprofile tool)

Platform Configuration

Manufacturer Intel Corporation
Product Name S2600WFD
BIOS Version 1SE5C620.86B.02.01.0008.031920191559
OS CentOS Linux release 8.4.2105
Kernel 4.18.0-305.3.1.el8.x86_64
Microcode 0x5003006
IRQ Balance Enabled
CPU Model Intel(R) Xeon Platinum 8280 CPU @ 2.70GHz
Base Frequency 2.7GHz
Maximum Frequency 4.0GHz
All-core Maximum Frequency 3.3GHz
CPU(s) 112
Thread(s) per Core 2
Core(s) per Socket 28
Socket(s) 2
NUMA Node(s) 2
Turbo Enabled
FrequencyGoverner Performance