Release Intel® Neural Compressor v2.4 Release · intel/neural-compressor

Highlights

Supported layer-wise quantization for PyTorch RTN/GPTQ Weight-Only Quantization and ONNX Runtime W8A8 quantization.
Supported Weight-Only Quantization tuning for ONNX Runtime backend.
Supported GGML double quant on RTN/GPTQ Weight-Only Quantization with FW extension API
Supported SmoothQuant of Big Saved Model for TensorFlow Backend.

Features

[Quantization] Support GGML double quant in Weight-Only Quantization for RTN and GPTQ (05c15a)
[Quantization] Support Weight-Only Quantization tuning for ONNX Runtime backend (6d4ea5, 934ba0, 4fcfdf)
[Quantization] Support SmoothQuant block-wise alpha-tuning (ee6bc2)
[Quantization] Support SmoothQuant of Big Saved Model for TensorFlow Backend (3b2925, 4f2c35)
[Quantization] Support PyTorch layer-wise quantization for GPTQ (ee5450)
[Quantization] support PyTorch layer-wise quantization for RTN (ebd1e2)
[Quantization] Support ONNX Runtime layer-wise W8A8 quantization (6142e4, 5d33a5)
[Common] [Experimental] FW extension API implement (76b8b3, 8447d7, 258236)
[Quantization] [Experimental] FW extension API for PT backend support Weight-Only Quantization (915018, dc9328)
[Quantization] [Experimental] FW extension API for TF backend support Keras Quantization (2627d3)
[Quantization] IPEX 2.1 XPU (CPU+GPU) support (af0b50, cf847c)

Improvement

[Quantization] Add use_optimum_format for export_compressed_model in Weight-Only Quantization (5179da, 0a0644)
[Quantization] Enhance ONNX Runtime quantization with DirectML EP (db0fef, d13183, 098401, 6cad50)
[Quantization] Support restore ipex model from json (c3214c)
[Quantization] ONNX Runtime add attr to MatMulNBits (7057e3)
[Quantization] Increase SmoothQuant auto alpha running speed (173c18)
[Quantization] Add SmoothQuant alpha search space as a config argument (f9663d)
[Quantization] Add SmoothQuant weight_clipping as a default_on option (1f4aec)
[Quantization] Support SmoothQuant with MinMaxObserver (45b496)
[Quantization] Support Weight-Only Quantization with fp16 for PyTorch backend (d5cb56)
[Quantization] Support trace with dictionary type example_inputs (afe315)
[Quantization] Support falcon Weight-Only Quantization (595d3a)
[Common] Add deprecation decorator in experimental fold (aeb3ed)
[Common] Remove 1.x API dependency (ee617a)
[Mixed Precision] Support PyTorch eager mode BF16 MixedPrecision (3bfb76)

Productivity

Bug Fixes

Examples

Support peft model with SmoothQuant (5e21b7)
Enable two ONNX Runtime examples table-transformer-detection (550cee), BEiT (7265df)

Validated Configurations

Provide feedback