-
Notifications
You must be signed in to change notification settings - Fork 7
Creating Quantized Models
The point of quantizing models is to allow faster (approximate) computation through integer arithmetic for general matrix multiplication, a.k.a. IntGEMM. We distinguish IntGEMM16 (16-bit quantization) and IntGEMM8 (8-bit quantization), or I16 or I8 for short. Quantization is particularly relevant for fast(er) inference on CPUs. The starting point in both cases is a trained model with 32-bit float (F32) parameters.
I16 only leads to small losses in translation quality. Fine-tuning of the quantized model is normally not required. To quantize an F32 model into I16, run
TO BE ADDED!
8-bit quantization without fine-tuning of the quantized model may lead to loss in translation quality. It is therefore recommend to fine-tune the model. To do so we
TO BE ADDED!
TO BE ADDED!
TO BE ADDED!
Memory footprint increases nearly linearly with the number of threads.
Mini-batch 32, shortlist | Mini-batch 32, no shortlist | Mini-batch 1, shortlist | Mini-batch 1, noshortlist |
---|---|---|---|
227 MB | 181 MB | 177 MB | 132 MB |
Mini-batch 32, shortlist | Mini-batch 32, no shortlist | Mini-batch 1, shortlist | Mini-batch 1, noshortlist |
---|---|---|---|
423 MB | 377 MB | 287 MB | 242 MB |