Releases: ModelCloud/GPTQModel
GPTQModel v1.0.6
What's Changed
Patch release to fix loading of quantized Llama 3.2 Vision model.
- [FIX] mllama loader by @LRL-ModelCloud in #404
Full Changelog: v1.0.5...v1.0.6
GPTQModel v1.0.5
What's Changed
Added partial quantization support Llama 3.2 Vision model. v1.0.5 allows quantization of text-layers (layers responsible for text-generation) only. We will add vision layer support shortly. A Llama 3.2 11B Vision Instruct models will quantize to 50% of the size in 4bit mode. Once vision layer support is added, the size will reduce to expected ~1/4.
- [MODEL] Add Llama 3.2 Vision (mllama)* support by @LRL-ModelCloud in #401
Full Changelog: v1.0.4...v1.0.5
GPTQModel v1.0.4
What's Changed
Liger Kernel support added for ~50% vram reduction in quantization stage for some models. Added toggle to disable parallel packing to avoid oom larger models. Transformers depend updated to 4.45.0 for Llama 3.2 support.
- [FEATURE] add a parallel_packing toggle by @LRL-ModelCloud in #393
- [FEATURE] add liger_kernel support by @LRL-ModelCloud in #394
Full Changelog: v1.0.3...v1.0.4
GPTQModel v1.0.3
What's Changed
- [MODEL] Add minicpm3 by @LDLINGLINGLING in #385
- [FIX] fix minicpm3 support by @LRL-ModelCloud in #387
- [MODEL] Added GRIN-MoE support by @LRL-ModelCloud in #388
New Contributors
- @LDLINGLINGLING made their first contribution in #385
- @mrT23 made their first contribution in #386
Full Changelog: v1.0.2...v1.0.3
GPTQModel v1.0.2
What's Changed
Upgrade the AutoRound package to v0.3.0. Pre-built WHL and PyPI source releases are now available. Installation can be done by downloading our pre-built WHL or using pip install gptqmodel --no-build-isolation
.
- [CORE] Autoround v0.3 by @LRL-ModelCloud in #368
- [CI] Lots of CI fixups by @CSY-ModelCloud
Full Changelog: v1.0.0...v1.0.2
v1.0.0
What's Changed
40% faster multi-threaded packing
, new lm_eval
api, fixed python 3.9 compat.
- Add
lm_eval
api by @PZS-ModelCloud in #338 - Multi-threaded
packing
in quantization by PZS-ModelCloud in #354 - [CI] Add TGI unit test by @PZS-ModelCloud in #348
- [CI] Updates by @CSY-ModelCloud in #347, #352, #353, #355, @CSY-ModelCloud in #357
- Fix python 3.9 compat by @PZS-ModelCloud in #358
Full Changelog: v0.9.11...v1.0.0
GPTQModel v0.9.11
What's Changed
Added LG EXAONE 3.0 model support. New dynamic per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to backend.BITBLAS. Auto-heal quantization errors due to small damp values.
- [CORE] add support for pack and shard to bitblas by @LRL-ModelCloud in #316
- Add
dynamic
bits by @PZS-ModelCloud in #311, #319, #321, #323, #327 - [MISC] Adjust the validate order of QuantLinear when BACKEND is AUTO by @ZX-ModelCloud in #318
- add save_quantized log model total size by @PZS-ModelCloud in #320
- Auto damp recovery by @CSY-ModelCloud in #326
- [FIX] add missing original_infeatures by @CSY-ModelCloud in #337
- Update Transformers to 4.44.0 by @Qubitium in #336
- [MODEL] add exaone model support by @LRL-ModelCloud in #340
- [CI] Upload wheel to local server by @CSY-ModelCloud in #339
- [MISC] Fix assert by @CSY-ModelCloud in #342
Full Changelog: v0.9.10...v0.9.11
GPTQModel v0.9.10
What's Changed
Ported vllm/nm gptq_marlin inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with format = FORMAT.GPTQ
. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized()
called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference stage.
- [CORE] add marlin inference kernel by @ZX-ModelCloud in #310
- [CI] Increase timeout to 40m by @CSY-ModelCloud in #295, #299
- [FIX] save_quantized() by @ZX-ModelCloud in #296
- [FIX] autoround nsample/seqlen to be actual size of calibration_dataset. by @LRL-ModelCloud in #297, @LRL-ModelCloud in #298
- Update HF transformers to 4.43.3 by @Qubitium in #305
- [CI] remove test_marlin_hf_cache_serialization() by @ZX-ModelCloud in #314
Full Changelog: v0.9.9...v0.9.10
GPTQModel v0.9.9
What's Changed
Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang.
- [CI] by @CSY-ModelCloud in #238, #236, #237, #241, #242, #243, #246, #247, #250
- [FIX] explicitly call torch.no_grad() by @LRL-ModelCloud in #239
- Bitblas update by @Qubitium in #249
- [FIX] calib avg for calib dataset arg passed as tensors by @Qubitium, @LRL-ModelCloud in #254, #258
- [MODEL] gemma2 27b can load with vLLM now by @LRL-ModelCloud in #257
- [OPTIMIZE] to optimize vllm inference, set an environment variable 'VLLM_ATTENTI… by @LRL-ModelCloud in #260
- [FIX] hard set batch_size to 1 for 4.43.0 transformer due to compat/regression by @LRL-ModelCloud in #279
- FIX vllm llama 3.1 support by @Qubitium in #280
- Use better defaults values for quantization config by @Qubitium in #281
- [REFRACTOR] Cleanup backend and model_type usage by @LRL-ModelCloud in #276
- [FIX] allow auto_round lm_head quantization by @LRL-ModelCloud in #282
- [FIX] [MODEL] Llama-3.1-8B-Instruct's eos_token_id is a list by @CSY-ModelCloud in #284
- [FIX] add release_vllm_model, and import destroy_model_parallel in release_vllm_model by @LRL-ModelCloud in #288
- [FIX] autoround quants compat with vllm/sglang by @Qubitium in #287
Full Changelog: v0.9.8...v0.9.9
GPTQModel v0.9.8
What's Changed
- Marlin end-to-end in/out feature padding for max model support
- Run quantized models (
FORMAT.GPTQ
) directly using fast vLLM backend! - Run quantized models (
FORMAT.GPTQ
) directly using fast SGLang backend!
- 🚀 🚀 [CORE] Marlin end-to-end in/out feature padding by @LRL-ModelCloud in #183 #192
- 🚀 🚀 [CORE] Add vLLM Backend for FORMAT.GPTQ by @PZS-ModelCloud in #190
- 🚀 🚀 [CORE] Add SGLang Backend by @PZS-ModelCloud in #191
- 🚀 [CORE] Use Triton v2 to pack gptq/gptqv2 formats by @LRL-ModelCloud in #202
- ✨ [CLEANUP] remove triton warmup by @Qubitium in #200
- 👾 [FIX] 8bit choosing wrong packer by @Qubitium in #199
- ✨ [CI] [CLEANUP] Improve Unit Tests by CSY, PSY, and ZYC
- ✨ [DOC] Consolidate Examples by ZYC in #225
Full Changelog: v0.9.7...v0.9.8