ReadMe update and QATv2 Refactoring #339

oguzhanbsolak · 2024-11-11T12:22:40Z

resolves #340

1- quantile function replaced with torch.quantile()
2- Improved Distributed mode compatibility
3- Fixed the issue caused when loading best_ckpt with compile mode

docs/QATv2.md

ermanok

Looks good,

yousefbilal · 2024-11-27T19:03:06Z

using incoming qatv2 branch,the KeyError is fixed, but when I run evaluate_cifar100_mobilenetv2 I am faced with the error:
Log file for this run: /mnt/d/ai8x/ai8x-training-qatv2/logs/2024.11.27-225820/2024.11.27-225820.log
Traceback (most recent call last):
File "/mnt/d/ai8x/ai8x-training-qatv2/train.py", line 1577, in
main()
File "/mnt/d/ai8x/ai8x-training-qatv2/train.py", line 409, in main
model = apputils.load_lean_checkpoint(model, args.load_model_path,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/d/ai8x/ai8x-training-qatv2/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 92, in load_lean_checkpoint
return load_checkpoint(model, chkpt_file, model_device=model_device,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/d/ai8x/ai8x-training-qatv2/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 249, in load_checkpoint
raise ValueError("The loaded checkpoint (%s) is missing %d state keys" %
ValueError: The loaded checkpoint (./logs/2024.11.27-223950/qat_best.pth.tar) is missing 136 state keys

oguzhanbsolak · 2024-11-29T08:24:11Z

using incoming qatv2 branch,the KeyError is fixed, but when I run evaluate_cifar100_mobilenetv2 I am faced with the error: Log file for this run: /mnt/d/ai8x/ai8x-training-qatv2/logs/2024.11.27-225820/2024.11.27-225820.log Traceback (most recent call last): File "/mnt/d/ai8x/ai8x-training-qatv2/train.py", line 1577, in main() File "/mnt/d/ai8x/ai8x-training-qatv2/train.py", line 409, in main model = apputils.load_lean_checkpoint(model, args.load_model_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/d/ai8x/ai8x-training-qatv2/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 92, in load_lean_checkpoint return load_checkpoint(model, chkpt_file, model_device=model_device, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/d/ai8x/ai8x-training-qatv2/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 249, in load_checkpoint raise ValueError("The loaded checkpoint (%s) is missing %d state keys" % ValueError: The loaded checkpoint (./logs/2024.11.27-223950/qat_best.pth.tar) is missing 136 state keys

Hi, thanks for getting in touch. Could you please share the ckpt file(2024.11.27-223950/qat_best.pth.tar) with us? Also, please indicate which script you are using: "evaluate_cifar100_mobilenet_v2_0.5.sh" or "evaluate_cifar100_mobilenet_v2_0.75.sh."

yousefbilal · 2024-11-29T09:55:13Z

using incoming qatv2 branch,the KeyError is fixed, but when I run evaluate_cifar100_mobilenetv2 I am faced with the error: Log file for this run: /mnt/d/ai8x/ai8x-training-qatv2/logs/2024.11.27-225820/2024.11.27-225820.log Traceback (most recent call last): File "/mnt/d/ai8x/ai8x-training-qatv2/train.py", line 1577, in main() File "/mnt/d/ai8x/ai8x-training-qatv2/train.py", line 409, in main model = apputils.load_lean_checkpoint(model, args.load_model_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/d/ai8x/ai8x-training-qatv2/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 92, in load_lean_checkpoint return load_checkpoint(model, chkpt_file, model_device=model_device, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/d/ai8x/ai8x-training-qatv2/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 249, in load_checkpoint raise ValueError("The loaded checkpoint (%s) is missing %d state keys" % ValueError: The loaded checkpoint (./logs/2024.11.27-223950/qat_best.pth.tar) is missing 136 state keys

Hi, thanks for getting in touch. Could you please share the ckpt file(2024.11.27-223950/qat_best.pth.tar) with us? Also, please indicate which script you are using: "evaluate_cifar100_mobilenet_v2_0.5.sh" or "evaluate_cifar100_mobilenet_v2_0.75.sh."

I ran evaluate_cifar100_mobilenet_v2_0.75.sh but I believe the same issue will still take place with evaluate_cifar100_mobilenet_v2_0.5.
qat_best.pth.tar.zip

oguzhanbsolak · 2024-11-29T10:14:42Z

using incoming qatv2 branch,the KeyError is fixed, but when I run evaluate_cifar100_mobilenetv2 I am faced with the error: Log file for this run: /mnt/d/ai8x/ai8x-training-qatv2/logs/2024.11.27-225820/2024.11.27-225820.log Traceback (most recent call last): File "/mnt/d/ai8x/ai8x-training-qatv2/train.py", line 1577, in main() File "/mnt/d/ai8x/ai8x-training-qatv2/train.py", line 409, in main model = apputils.load_lean_checkpoint(model, args.load_model_path, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/d/ai8x/ai8x-training-qatv2/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 92, in load_lean_checkpoint return load_checkpoint(model, chkpt_file, model_device=model_device, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/d/ai8x/ai8x-training-qatv2/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 249, in load_checkpoint raise ValueError("The loaded checkpoint (%s) is missing %d state keys" % ValueError: The loaded checkpoint (./logs/2024.11.27-223950/qat_best.pth.tar) is missing 136 state keys

Hi, thanks for getting in touch. Could you please share the ckpt file(2024.11.27-223950/qat_best.pth.tar) with us? Also, please indicate which script you are using: "evaluate_cifar100_mobilenet_v2_0.5.sh" or "evaluate_cifar100_mobilenet_v2_0.75.sh."

I ran evaluate_cifar100_mobilenet_v2_0.75.sh but I believe the same issue will still take place with evaluate_cifar100_mobilenet_v2_0.5. qat_best.pth.tar.zip

Please update your evaluation script to include the QAT policy used during training by specifying it as --qat-policy policy_used_in_training.yaml. Once QAT starts, batch normalization parameters are fused, and as a result, QAT checkpoints do not include batch normalization parameters.

If the QAT policy is not specified in the evaluation script, the default policy is applied, where QAT begins at epoch 30. Since your checkpoint is from epoch 19, failing to declare the correct policy will result in a mismatch: the model will still contain batch normalization layers, but the checkpoint will lack the corresponding parameters.

P.S.: Another thing that can be missed is the -8 parameter in the evaluation script. The -8 parameter is intended for evaluating a quantized checkpoint. Since your current checkpoint is from QAT, you must remove this parameter to properly evaluate the model. After completing the synthesizer steps, you can reintroduce the -8 parameter to evaluate the fully quantized checkpoint.

oguzhanbsolak requested review from alicangok, ermanok, MaximGorkem, seldauyanik-maxim and asyatrhl November 11, 2024 12:22

ReadMe update and QATv2 Refactoring

1cae6b0

oguzhanbsolak force-pushed the qatv2 branch from 06c4d8f to 1cae6b0 Compare November 21, 2024 11:37

oguzhanbsolak added 3 commits November 21, 2024 17:43

Distributed mode improvements, quantile function replacement

2ee12fb

Revert local_rank change for pre_qat

92568a2

Add optimize_ddp flag to test()

a9507b0

seldauyanik approved these changes Nov 26, 2024

View reviewed changes

ermanok suggested changes Nov 26, 2024

View reviewed changes

docs/QATv2.md Outdated Show resolved Hide resolved

docs/QATv2.md Outdated Show resolved Hide resolved

docs/QATv2.md Outdated Show resolved Hide resolved

QATv2.md update

27de763

ermanok approved these changes Nov 27, 2024

View reviewed changes

asyatrhl approved these changes Dec 6, 2024

View reviewed changes

MaximGorkem merged commit 0fa1fb1 into analogdevicesinc:develop Dec 6, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadMe update and QATv2 Refactoring #339

ReadMe update and QATv2 Refactoring #339

oguzhanbsolak commented Nov 11, 2024 •

edited

Loading

ermanok left a comment

yousefbilal commented Nov 27, 2024

oguzhanbsolak commented Nov 29, 2024

yousefbilal commented Nov 29, 2024

oguzhanbsolak commented Nov 29, 2024 •

edited

Loading

ReadMe update and QATv2 Refactoring #339

ReadMe update and QATv2 Refactoring #339

Conversation

oguzhanbsolak commented Nov 11, 2024 • edited Loading

ermanok left a comment

Choose a reason for hiding this comment

yousefbilal commented Nov 27, 2024

oguzhanbsolak commented Nov 29, 2024

yousefbilal commented Nov 29, 2024

oguzhanbsolak commented Nov 29, 2024 • edited Loading

oguzhanbsolak commented Nov 11, 2024 •

edited

Loading

oguzhanbsolak commented Nov 29, 2024 •

edited

Loading