KeyError: '_orig_mod.conv1.output_shift' #340

yousefbilal · 2024-11-18T13:49:26Z

I followed the installation and setup steps from the README. However, when I run a training script such as scripts/train_mnist.sh I am faced with the error:
Traceback (most recent call last):
File "/home/wsl/ai8x-training/train.py", line 1564, in
main()
File "/home/wsl/ai8x-training/train.py", line 742, in main
test(test_loader, model, criterion, [pylogger], args=args, mode="best",
File "/home/wsl/ai8x-training/train.py", line 1097, in test
model = apputils.load_lean_checkpoint(model, best_ckpt_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 92, in load_lean_checkpoint
return load_checkpoint(model, chkpt_file, model_device=model_device,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 212, in load_checkpoint
normalize_dataparallel_keys = _load_compression_scheduler()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 133, in _load_compression_scheduler
compression_scheduler.load_state_dict(checkpoint['compression_sched'], normalize_keys)
File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/scheduler.py", line 213, in load_state_dict
masker.mask = loaded_masks[name]
~~~~~~~~~~~~^^^^^^
KeyError: '_orig_mod.conv1.output_shift'

This happens when the training is done for all the training scripts that I tried. Please help me in resolving it.

The text was updated successfully, but these errors were encountered:

FlukeMulti · 2024-11-18T14:11:02Z

I am facing the exact same error on WSL and native windows when training the sample scripts.

matiasV-TYN · 2024-11-18T15:45:21Z

I was having similar issue. Fix was to run from the qatv2 branch instead of the default. You can see my issue #335.
Basically you need PR #331 and PR 354 from synthesis repo (analogdevicesinc/ai8x-synthesis#354).

yousefbilal · 2024-11-19T06:13:55Z

Can you provide some more details? I believe qatv2 has already been merged to the default branch

matiasV-TYN · 2024-11-19T09:16:23Z

Hmm maybe it has. But when I had this same problem about 3 weeks ago, I had to pull 331 for training repo and 354 for synthesis myself. After that the training scripts run with no problem.

oguzhanbsolak · 2024-11-21T09:10:50Z

I followed the installation and setup steps from the README. However, when I run a training script such as scripts/train_mnist.sh I am faced with the error: Traceback (most recent call last): File "/home/wsl/ai8x-training/train.py", line 1564, in main() File "/home/wsl/ai8x-training/train.py", line 742, in main test(test_loader, model, criterion, [pylogger], args=args, mode="best", File "/home/wsl/ai8x-training/train.py", line 1097, in test model = apputils.load_lean_checkpoint(model, best_ckpt_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 92, in load_lean_checkpoint return load_checkpoint(model, chkpt_file, model_device=model_device, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 212, in load_checkpoint normalize_dataparallel_keys = _load_compression_scheduler() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 133, in _load_compression_scheduler compression_scheduler.load_state_dict(checkpoint['compression_sched'], normalize_keys) File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/scheduler.py", line 213, in load_state_dict masker.mask = loaded_masks[name] ~~~~~~~~~~~~^^^^^^ KeyError: '_orig_mod.conv1.output_shift'

This happens when the training is done for all the training scripts that I tried. Please help me in resolving it.

Hi,

Thank you for letting us know about this issue. This is a known issue and will be fixed in the next PR. As a temporary solution you may pass "--compiler-mode none" argument in your training script. Also, if you have multiple gpus, please add "--gpus 0" argument, and don't use the distributed training. Please note that this step was realized at the end of the training to evaluate your best checkpoint. You can safely ignore this error and continue with the other steps as well.

ermanok assigned oguzhanbsolak Nov 21, 2024

oguzhanbsolak mentioned this issue Nov 21, 2024

ReadMe update and QATv2 Refactoring #339

Merged

MaximGorkem closed this as completed in #339 Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: '_orig_mod.conv1.output_shift' #340

KeyError: '_orig_mod.conv1.output_shift' #340

yousefbilal commented Nov 18, 2024

FlukeMulti commented Nov 18, 2024

matiasV-TYN commented Nov 18, 2024 •

edited

Loading

yousefbilal commented Nov 19, 2024

matiasV-TYN commented Nov 19, 2024

oguzhanbsolak commented Nov 21, 2024 •

edited

Loading

KeyError: '_orig_mod.conv1.output_shift' #340

KeyError: '_orig_mod.conv1.output_shift' #340

Comments

yousefbilal commented Nov 18, 2024

FlukeMulti commented Nov 18, 2024

matiasV-TYN commented Nov 18, 2024 • edited Loading

yousefbilal commented Nov 19, 2024

matiasV-TYN commented Nov 19, 2024

oguzhanbsolak commented Nov 21, 2024 • edited Loading

matiasV-TYN commented Nov 18, 2024 •

edited

Loading

oguzhanbsolak commented Nov 21, 2024 •

edited

Loading