Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int4 quant broken right now? #217

Open
jerryzh168 opened this issue Dec 20, 2024 · 2 comments
Open

int4 quant broken right now? #217

jerryzh168 opened this issue Dec 20, 2024 · 2 comments

Comments

@jerryzh168
Copy link
Contributor

I tried the following and seems it breaks right now

> python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int4 --groupsize 64
Loading model ...
Quantizing model weights for int4 weight-only affine per-channel groupwise quantization
linear: layers.0.attention.wqkv, in=4096, out=6144
Traceback (most recent call last):
  File "/data/users/jerryzh/gpt-fast/quantize.py", line 622, in <module>
    quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
  File "/data/users/jerryzh/gpt-fast/quantize.py", line 569, in quantize
    quantized_state_dict = quant_handler.create_quantized_state_dict()
  File "/home/jerryzh/.conda/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/users/jerryzh/gpt-fast/quantize.py", line 433, in create_quantized_state_dict
    weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
  File "/data/users/jerryzh/gpt-fast/quantize.py", line 363, in prepare_int4_weight_and_scales_and_zeros
    weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight_int32, inner_k_tiles)
  File "/home/jerryzh/.conda/envs/sglang/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: Expected in.dtype() == at::kByte to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)

it's probably because of @yanbing-j's recent refactors, but I'm not sure if we want to migrate to use torchao's quant at some point so not sure if it's worth fixing now.

@jerryzh168
Copy link
Contributor Author

I can quantize with #208 though, so probably landing that is sufficient

@yanbing-j
Copy link

@jerryzh168 I have a PR before #187, which should fix int4 api mismatch of cuda device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants