Intel® Neural Compressor v2.3 Release
- Highlights
- Features
- Improvement
- Productivity
- Bug Fixes
- Examples
- Validated Configurations
Highlights
- Integrate Intel Neural Compressor into MSFT ONNX Runtime (#16288) and Olive (#411, #412, #469).
- Supported low precision (INT4, NF4, FP4) and Weight-Only Quantization algorithms including RTN, AWQ, GPTQ and TEQ on ONNX Runtime and PyTorch for LLMs optimization.
- Supported sparseGPT pruner (88adfc).
- Supported quantization for ONNX Runtime DML EP and DNNL EP, and verified inference on Intel NPU (e.g., Meteor Lake) and Intel CPU (e.g., Sapphire Rapids).
Features
- [Quantization] Support ONNX Runtime quantization and inference for DNNL EP (79be8b)
- [Quantization] [Experimental] Support ONNX Runtime quantization and inference for DirectML EP (750bb9)
- [Quantization] Support low precision and Weight-Only Quantization (WOQ) algorithms, including RTN (501440, 19ab16, 859315), AWQ (2562f2, 641d42),
GPTQ (b5ac3c, 6ba783) and TEQ (d2f995, 9ff7f0) for PyTorch - [Quantization] Support NF4 and FP4 data type for PyTorch Weight-Only Quantization (3d11b5)
- [Quantization] Support low precision and Weight-Only Quantization algorithms, including RTN, AWQ and GPTQ for ONNX Runtime (da4c92)
- [Quantization] Support layer-wise quantization (d9d1fc) and enable with SmoothQuant (ec9ae9)
- [Pruning] Add sparseGPT pruner and refactor pruning class (88adfc)
- [Pruning] Add Hyper-parameter Optimization algorithm for pruning (6613cf)
- [Model Export] Support PT2ONNX dynamic quantization export (165532)
Improvement
- [Common] Clean up dataloader usage in examples (1044d8,
a2931e, 447cc7) - [Common] Enhance ONNX Runtime backend check (4ce9de)
- [Strategy] Add block-wise distributed fallback in basic strategy (ea309f)
- [Strategy] Enhance strategy exit policy (d19b42)
- [Quantization] Add WeightOnlyLinear for Weight-Only approach to allow low memory inference (00bbf8)
- [Quantization] Support more ONNX Runtime direct INT8 ops (b9ce61)
- [Quantization] Support TensorFlow per-channel MatMul quantization (cf5589)
- [Quantization] Implement a new method to perform alpha auto-tuning in SmoothQuant (084eda)
- [Quantization] Enhance ONNX SmoothQuant tuning structure (f0d51c)
- [Quantization] Enhance PyTorch SmoothQuant tuning structure (81da40)
- [Quantization] Update PyTorch examples dataloader to support transformers 4.31.x (59371f)
- [Quantization] Enhance ONNX Runtime backend setting for GPU EP support (295535)
- [Pruning] Refactor pruning (92d14d)
- [Mixed Precision] Update the list of supported layers for Keras mix-precision (692c8b)
- [Mixed Precision] Introduce quant_level into mixed precision (0dc6a9)
Productivity
- [Ecosystem] MSFT Olive integrate SmoothQuant and 3 LLM examples (#411, #412, #469)
- [Ecosystem] MSFT ONNX Runtime integrate SmoothQuant static quantization (#16288)
- [Neural Insights] Support PyTorch FX inspect tensor and integrate with Neural Insights (775def, 74a785)
- [Neural Insights] Add step-by-step diagnosis cases (99c3b0)
- [Neural Solution] Resource management and user-facing API enhancement (fbba10)
- [Auto CI] Integrate auto CI code scan bug fix tools (f77a2c, 06cc38)
Bug Fixes
- Fix bugs in PyTorch SmoothQuant (0349b9, 8f3645)
- Fix pytorch dataloader batch size issue (6a98d0)
- Fix bugs for ONNX Runtime CUDA EP (a1b566, d1f315)
- Fix bug in ONNX Runtime adapter where _rename_node function fails with model size > 2 GB (1f6b1a)
- Fix ONNX Runtime diagnosis bug (f10e26)
- Update Neural Solution example and fix grpc port issue (528868)
- Fix the objective initialization issue (9d7546)
- Fix reshape issue for bayesian strategy (77cb83)
- Fix CVEs (d86922, 2bbfcd, fc71fa)
Examples
- Add Weight-Only LLM examples for PyTorch (4b24be, 66f7c1, aa457a)
- Add Weight-Only LLM examples for ONNX Runtime (10c133)
- Enable 3 ONNX Runtime examples, CodeBert (5e584e), LayoutLMv2 FUNSD (5f0b17), Table Transformer (eb8a95)
- Add ONNX Runtime LLM SmoothQuant example Llama-7B (7fbcf5)
- Enable 2 TensorFlow examples, ViT (94df99), GraphSage (29ec82)
- Add easy get started notebooks (d7b608, 6ee846)
- Add multi-cards magnitude pruning use case (909618)
- Unify ONNX Runtime prepare model scripts (5ecb13)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04
- Python 3.7, 3.8, 3.9, 3.10, 3.11
- TensorFlow 2.11, 2.12, 2.13
- ITEX 1.1.0, 1.2.0, 2.13.0.0
- PyTorch/IPEX 1.12.1+cpu, 1.13.0+cpu, 2.0.1+cpu
- ONNX Runtime 1.13.1, 1.14.1, 1.15.1
- MXNet 1.9.1