Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: '_orig_mod.conv1.output_shift' #340

Closed
yousefbilal opened this issue Nov 18, 2024 · 5 comments · Fixed by #339
Closed

KeyError: '_orig_mod.conv1.output_shift' #340

yousefbilal opened this issue Nov 18, 2024 · 5 comments · Fixed by #339
Assignees

Comments

@yousefbilal
Copy link

I followed the installation and setup steps from the README. However, when I run a training script such as scripts/train_mnist.sh I am faced with the error:
Traceback (most recent call last):
File "/home/wsl/ai8x-training/train.py", line 1564, in
main()
File "/home/wsl/ai8x-training/train.py", line 742, in main
test(test_loader, model, criterion, [pylogger], args=args, mode="best",
File "/home/wsl/ai8x-training/train.py", line 1097, in test
model = apputils.load_lean_checkpoint(model, best_ckpt_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 92, in load_lean_checkpoint
return load_checkpoint(model, chkpt_file, model_device=model_device,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 212, in load_checkpoint
normalize_dataparallel_keys = _load_compression_scheduler()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 133, in _load_compression_scheduler
compression_scheduler.load_state_dict(checkpoint['compression_sched'], normalize_keys)
File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/scheduler.py", line 213, in load_state_dict
masker.mask = loaded_masks[name]
~~~~~~~~~~~~^^^^^^
KeyError: '_orig_mod.conv1.output_shift'

This happens when the training is done for all the training scripts that I tried. Please help me in resolving it.

@FlukeMulti
Copy link

I am facing the exact same error on WSL and native windows when training the sample scripts.

@matiasV-TYN
Copy link

matiasV-TYN commented Nov 18, 2024

I was having similar issue. Fix was to run from the qatv2 branch instead of the default. You can see my issue #335.
Basically you need PR #331 and PR 354 from synthesis repo (analogdevicesinc/ai8x-synthesis#354).

@yousefbilal
Copy link
Author

Can you provide some more details? I believe qatv2 has already been merged to the default branch

@matiasV-TYN
Copy link

Hmm maybe it has. But when I had this same problem about 3 weeks ago, I had to pull 331 for training repo and 354 for synthesis myself. After that the training scripts run with no problem.

@oguzhanbsolak
Copy link
Contributor

oguzhanbsolak commented Nov 21, 2024

I followed the installation and setup steps from the README. However, when I run a training script such as scripts/train_mnist.sh I am faced with the error: Traceback (most recent call last): File "/home/wsl/ai8x-training/train.py", line 1564, in main() File "/home/wsl/ai8x-training/train.py", line 742, in main test(test_loader, model, criterion, [pylogger], args=args, mode="best", File "/home/wsl/ai8x-training/train.py", line 1097, in test model = apputils.load_lean_checkpoint(model, best_ckpt_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 92, in load_lean_checkpoint return load_checkpoint(model, chkpt_file, model_device=model_device, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 212, in load_checkpoint normalize_dataparallel_keys = _load_compression_scheduler() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/apputils/checkpoint.py", line 133, in _load_compression_scheduler compression_scheduler.load_state_dict(checkpoint['compression_sched'], normalize_keys) File "/home/wsl/ai8x-training/distiller/build/editable.distiller-0.4.0rc0-py3-none-any/distiller/scheduler.py", line 213, in load_state_dict masker.mask = loaded_masks[name] ~~~~~~~~~~~~^^^^^^ KeyError: '_orig_mod.conv1.output_shift'

This happens when the training is done for all the training scripts that I tried. Please help me in resolving it.

Hi,

Thank you for letting us know about this issue. This is a known issue and will be fixed in the next PR. As a temporary solution you may pass "--compiler-mode none" argument in your training script. Also, if you have multiple gpus, please add "--gpus 0" argument, and don't use the distributed training. Please note that this step was realized at the end of the training to evaluate your best checkpoint. You can safely ignore this error and continue with the other steps as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants