Skip to content

Releases: ModelCloud/GPTQModel

GPTQModel v1.0.6

26 Sep 15:59
25e7313
Compare
Choose a tag to compare

What's Changed

Patch release to fix loading of quantized Llama 3.2 Vision model.

Full Changelog: v1.0.5...v1.0.6

GPTQModel v1.0.5

26 Sep 10:54
4921d68
Compare
Choose a tag to compare

What's Changed

Added partial quantization support Llama 3.2 Vision model. v1.0.5 allows quantization of text-layers (layers responsible for text-generation) only. We will add vision layer support shortly. A Llama 3.2 11B Vision Instruct models will quantize to 50% of the size in 4bit mode. Once vision layer support is added, the size will reduce to expected ~1/4.

Full Changelog: v1.0.4...v1.0.5

GPTQModel v1.0.4

26 Sep 04:26
cffee9a
Compare
Choose a tag to compare

What's Changed

Liger Kernel support added for ~50% vram reduction in quantization stage for some models. Added toggle to disable parallel packing to avoid oom larger models. Transformers depend updated to 4.45.0 for Llama 3.2 support.

Full Changelog: v1.0.3...v1.0.4

GPTQModel v1.0.3

19 Sep 06:36
44b9df7
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.0.2...v1.0.3

GPTQModel v1.0.2

17 Aug 01:44
182df2b
Compare
Choose a tag to compare

What's Changed

Upgrade the AutoRound package to v0.3.0. Pre-built WHL and PyPI source releases are now available. Installation can be done by downloading our pre-built WHL or using pip install gptqmodel --no-build-isolation.

Full Changelog: v1.0.0...v1.0.2

v1.0.0

14 Aug 00:29
4a028d5
Compare
Choose a tag to compare

What's Changed

40% faster multi-threaded packing, new lm_eval api, fixed python 3.9 compat.

Full Changelog: v0.9.11...v1.0.0

GPTQModel v0.9.11

09 Aug 10:33
f2fcdc8
Compare
Choose a tag to compare

What's Changed

Added LG EXAONE 3.0 model support. New dynamic per layer/module flexible quantization where each layer/module may have different bits/params. Added proper sharding support to backend.BITBLAS. Auto-heal quantization errors due to small damp values.

Full Changelog: v0.9.10...v0.9.11

GPTQModel v0.9.10

30 Jul 19:04
233548b
Compare
Choose a tag to compare

What's Changed

Ported vllm/nm gptq_marlin inference kernel with expanded bits (8bits), group_size (64,32), and desc_act support for all GPTQ models with format = FORMAT.GPTQ. Auto calculate auto-round nsamples/seglen parameters based on calibration dataset. Fixed save_quantized() called on pre-quantized models with non-supported backends. HF transformers depend updated to ensure Llama 3.1 fixes are correctly applied to both quant and inference stage.

Full Changelog: v0.9.9...v0.9.10

GPTQModel v0.9.9

24 Jul 16:42
519fbe3
Compare
Choose a tag to compare

What's Changed

Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang.

Full Changelog: v0.9.8...v0.9.9

GPTQModel v0.9.8

13 Jul 12:55
0d263f3
Compare
Choose a tag to compare

What's Changed

  1. Marlin end-to-end in/out feature padding for max model support
  2. Run quantized models (FORMAT.GPTQ) directly using fast vLLM backend!
  3. Run quantized models (FORMAT.GPTQ) directly using fast SGLang backend!

Full Changelog: v0.9.7...v0.9.8