Intel® Neural Compressor v2.4 Release
- Highlights
- Features
- Improvement
- Productivity
- Bug Fixes
- Examples
- Validated Configurations
Highlights
- Supported layer-wise quantization for PyTorch RTN/GPTQ Weight-Only Quantization and ONNX Runtime W8A8 quantization.
- Supported Weight-Only Quantization tuning for ONNX Runtime backend.
- Supported GGML double quant on RTN/GPTQ Weight-Only Quantization with FW extension API
- Supported SmoothQuant of Big Saved Model for TensorFlow Backend.
Features
- [Quantization] Support GGML double quant in Weight-Only Quantization for RTN and GPTQ (05c15a)
- [Quantization] Support Weight-Only Quantization tuning for ONNX Runtime backend (6d4ea5, 934ba0, 4fcfdf)
- [Quantization] Support SmoothQuant block-wise alpha-tuning (ee6bc2)
- [Quantization] Support SmoothQuant of Big Saved Model for TensorFlow Backend (3b2925, 4f2c35)
- [Quantization] Support PyTorch layer-wise quantization for GPTQ (ee5450)
- [Quantization] support PyTorch layer-wise quantization for RTN (ebd1e2)
- [Quantization] Support ONNX Runtime layer-wise W8A8 quantization (6142e4, 5d33a5)
- [Common] [Experimental] FW extension API implement (76b8b3, 8447d7, 258236)
- [Quantization] [Experimental] FW extension API for PT backend support Weight-Only Quantization (915018, dc9328)
- [Quantization] [Experimental] FW extension API for TF backend support Keras Quantization (2627d3)
- [Quantization] IPEX 2.1 XPU (CPU+GPU) support (af0b50, cf847c)
Improvement
- [Quantization] Add use_optimum_format for export_compressed_model in Weight-Only Quantization (5179da, 0a0644)
- [Quantization] Enhance ONNX Runtime quantization with DirectML EP (db0fef, d13183, 098401, 6cad50)
- [Quantization] Support restore ipex model from json (c3214c)
- [Quantization] ONNX Runtime add attr to MatMulNBits (7057e3)
- [Quantization] Increase SmoothQuant auto alpha running speed (173c18)
- [Quantization] Add SmoothQuant alpha search space as a config argument (f9663d)
- [Quantization] Add SmoothQuant weight_clipping as a default_on option (1f4aec)
- [Quantization] Support SmoothQuant with MinMaxObserver (45b496)
- [Quantization] Support Weight-Only Quantization with fp16 for PyTorch backend (d5cb56)
- [Quantization] Support trace with dictionary type example_inputs (afe315)
- [Quantization] Support falcon Weight-Only Quantization (595d3a)
- [Common] Add deprecation decorator in experimental fold (aeb3ed)
- [Common] Remove 1.x API dependency (ee617a)
- [Mixed Precision] Support PyTorch eager mode BF16 MixedPrecision (3bfb76)
Productivity
- Support quantization and benchmark on macOS (16d6a0)
- Support ONNX Runtime 1.16.0 (d81732, 299af9, 753783)
- Support TensorFlow new API for gnr-base (8160c7)
Bug Fixes
- Fix GraphModule object has no attribute bias (7f53d1)
- Fix ONNX model export issue (af0aea, eaa57f)
- Add clip for ONNX Runtime SmoothQuant (cbb69b)
- Fix SmoothQuant minmax observer init (b1db1c)
- Fix SmoothQuant issue in get/set_module (dffcfe)
- Align sparsity with block-wise masks in progressive pruning (fcdc29)
Examples
- Support peft model with SmoothQuant (5e21b7)
- Enable two ONNX Runtime examples table-transformer-detection (550cee), BEiT (7265df)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Win10 & MacOS Ventura 13.5
- Python 3.8, 3.9, 3.10, 3.11
- TensorFlow 2.13, 2.14, 2.15
- ITEX 1.2.0, 2.13.0.0, 2.14.0.1
- PyTorch/IPEX 1.13.0+cpu, 2.0.1+cpu, 2.1.0
- ONNX Runtime 1.14.1, 1.15.1, 1.16.3
- MXNet 1.9.1