Releases · ModelCloud/GPTQModel

Added partial quantization support Llama 3.2 Vision model. v1.0.5 allows quantization of text-layers (layers responsible for text-generation) only. We will add vision layer support shortly. A Llama 3.2 11B Vision Instruct models will quantize to 50% of the size in 4bit mode. Once vision layer support is added, the size will reduce to expected ~1/4.

[MODEL] Add Llama 3.2 Vision (mllama)* support by @LRL-ModelCloud in #401

Full Changelog: v1.0.4...v1.0.5

Contributors

LRL-ModelCloud

Assets 15

26 Sep 04:26

Qubitium

v1.0.4

cffee9a

GPTQModel v1.0.4

What's Changed

Liger Kernel support added for ~50% vram reduction in quantization stage for some models. Added toggle to disable parallel packing to avoid oom larger models. Transformers depend updated to 4.45.0 for Llama 3.2 support.

[FEATURE] add a parallel_packing toggle by @LRL-ModelCloud in #393
[FEATURE] add liger_kernel support by @LRL-ModelCloud in #394

Full Changelog: v1.0.3...v1.0.4

Contributors

LRL-ModelCloud

Assets 15

19 Sep 06:36

Qubitium

v1.0.3

44b9df7

GPTQModel v1.0.3

What's Changed

[MODEL] Add minicpm3 by @LDLINGLINGLING in #385
[FIX] fix minicpm3 support by @LRL-ModelCloud in #387
[MODEL] Added GRIN-MoE support by @LRL-ModelCloud in #388

New Contributors

@LDLINGLINGLING made their first contribution in #385
@mrT23 made their first contribution in #386

Full Changelog: v1.0.2...v1.0.3

Contributors

mrT23, LDLINGLINGLING, and LRL-ModelCloud

Assets 3

17 Aug 01:44

Qubitium

v1.0.2

182df2b

GPTQModel v1.0.2

What's Changed

Upgrade the AutoRound package to v0.3.0. Pre-built WHL and PyPI source releases are now available. Installation can be done by downloading our pre-built WHL or using pip install gptqmodel --no-build-isolation.

[CORE] Autoround v0.3 by @LRL-ModelCloud in #368
[CI] Lots of CI fixups by @CSY-ModelCloud

Full Changelog: v1.0.0...v1.0.2

Contributors

LRL-ModelCloud and CSY-ModelCloud

Assets 27

14 Aug 00:29

Qubitium

v1.0.0

4a028d5

v1.0.0

What's Changed

40% faster multi-threaded packing, new lm_eval api, fixed python 3.9 compat.

Add lm_eval api by @PZS-ModelCloud in #338
Multi-threaded packing in quantization by PZS-ModelCloud in #354
[CI] Add TGI unit test by @PZS-ModelCloud in #348
[CI] Updates by @CSY-ModelCloud in #347, #352, #353, #355, @CSY-ModelCloud in #357
Fix python 3.9 compat by @PZS-ModelCloud in #358

Full Changelog: v0.9.11...v1.0.0

Contributors

PZS-ModelCloud and CSY-ModelCloud

Assets 27

09 Aug 10:33

Qubitium

v0.9.11

f2fcdc8

GPTQModel v0.9.11

What's Changed

Added LG EXAONE 3.0 model support. New dynamic per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to backend.BITBLAS. Auto-heal quantization errors due to small damp values.

[CORE] add support for pack and shard to bitblas by @LRL-ModelCloud in #316
Add dynamic bits by @PZS-ModelCloud in #311, #319, #321, #323, #327
[MISC] Adjust the validate order of QuantLinear when BACKEND is AUTO by @ZX-ModelCloud in #318
add save_quantized log model total size by @PZS-ModelCloud in #320
Auto damp recovery by @CSY-ModelCloud in #326
[FIX] add missing original_infeatures by @CSY-ModelCloud in #337
Update Transformers to 4.44.0 by @Qubitium in #336
[MODEL] add exaone model support by @LRL-ModelCloud in #340
[CI] Upload wheel to local server by @CSY-ModelCloud in #339
[MISC] Fix assert by @CSY-ModelCloud in #342

Full Changelog: v0.9.10...v0.9.11

Contributors

Qubitium, PZS-ModelCloud, and 3 other contributors

Assets 2

30 Jul 19:04

Qubitium

v0.9.10

233548b

GPTQModel v0.9.10

What's Changed

Ported vllm/nm gptq_marlin inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with format = FORMAT.GPTQ. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference stage.

[CORE] add marlin inference kernel by @ZX-ModelCloud in #310
[CI] Increase timeout to 40m by @CSY-ModelCloud in #295, #299
[FIX] save_quantized() by @ZX-ModelCloud in #296
[FIX] autoround nsample/seqlen to be actual size of calibration_dataset. by @LRL-ModelCloud in #297, @LRL-ModelCloud in #298
Update HF transformers to 4.43.3 by @Qubitium in #305
[CI] remove test_marlin_hf_cache_serialization() by @ZX-ModelCloud in #314

Full Changelog: v0.9.9...v0.9.10

Contributors

Qubitium, ZX-ModelCloud, and 2 other contributors

Assets 2

24 Jul 16:42

Qubitium

v0.9.9

519fbe3

GPTQModel v0.9.9

What's Changed

Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang.

[CI] by @CSY-ModelCloud in #238, #236, #237, #241, #242, #243, #246, #247, #250
[FIX] explicitly call torch.no_grad() by @LRL-ModelCloud in #239
Bitblas update by @Qubitium in #249
[FIX] calib avg for calib dataset arg passed as tensors by @Qubitium, @LRL-ModelCloud in #254, #258
[MODEL] gemma2 27b can load with vLLM now by @LRL-ModelCloud in #257
[OPTIMIZE] to optimize vllm inference, set an environment variable 'VLLM_ATTENTI… by @LRL-ModelCloud in #260
[FIX] hard set batch_size to 1 for 4.43.0 transformer due to compat/regression by @LRL-ModelCloud in #279
FIX vllm llama 3.1 support by @Qubitium in #280
Use better defaults values for quantization config by @Qubitium in #281
[REFRACTOR] Cleanup backend and model_type usage by @LRL-ModelCloud in #276
[FIX] allow auto_round lm_head quantization by @LRL-ModelCloud in #282
[FIX] [MODEL] Llama-3.1-8B-Instruct's eos_token_id is a list by @CSY-ModelCloud in #284
[FIX] add release_vllm_model, and import destroy_model_parallel in release_vllm_model by @LRL-ModelCloud in #288
[FIX] autoround quants compat with vllm/sglang by @Qubitium in #287

Full Changelog: v0.9.8...v0.9.9

Contributors

Qubitium, LRL-ModelCloud, and CSY-ModelCloud

Assets 2

13 Jul 12:55

Qubitium

v0.9.8

0d263f3

GPTQModel v0.9.8

What's Changed

Marlin end-to-end in/out feature padding for max model support
Run quantized models (FORMAT.GPTQ) directly using fast vLLM backend!
Run quantized models (FORMAT.GPTQ) directly using fast SGLang backend!

🚀 🚀 [CORE] Marlin end-to-end in/out feature padding by @LRL-ModelCloud in #183 #192
🚀 🚀 [CORE] Add vLLM Backend for FORMAT.GPTQ by @PZS-ModelCloud in #190
🚀 🚀 [CORE] Add SGLang Backend by @PZS-ModelCloud in #191
🚀 [CORE] Use Triton v2 to pack gptq/gptqv2 formats by @LRL-ModelCloud in #202
✨ [CLEANUP] remove triton warmup by @Qubitium in #200
👾 [FIX] 8bit choosing wrong packer by @Qubitium in #199
✨ [CI] [CLEANUP] Improve Unit Tests by CSY, PSY, and ZYC
✨ [DOC] Consolidate Examples by ZYC in #225

Full Changelog: v0.9.7...v0.9.8

Contributors

Qubitium, PZS-ModelCloud, and LRL-ModelCloud

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

Releases: ModelCloud/GPTQModel

GPTQModel v1.0.6

What's Changed

Contributors

GPTQModel v1.0.5

What's Changed

Contributors

GPTQModel v1.0.4

What's Changed

Contributors

GPTQModel v1.0.3

What's Changed

New Contributors

Contributors

GPTQModel v1.0.2

What's Changed

Contributors

v1.0.0

What's Changed

Contributors

GPTQModel v0.9.11

What's Changed

Contributors

GPTQModel v0.9.10

What's Changed

Contributors

GPTQModel v0.9.9

What's Changed

Contributors

GPTQModel v0.9.8

What's Changed

Contributors