You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hi!
i have problem about restore checkpoint.
It stopped learning, so I tried to restore but got an error.
help! T^T
restore is True
Found folder! restoring
Traceback (most recent call last):
File "train.py", line 122, in <module>
learning_rate_scheduler=scheduler)
File "/home/ailab/HGL-pytorch/utils/pytorch_misc.py", line 226, in restore_checkpoint
training_state = torch.load(training_state_path, map_location=device_mapping(-1))
File "/home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
return _load(f, map_location, pickle_module)
File "/home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/serialization.py", line 549, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 4859355 more bytes. The file might be corrupted.
terminate called after throwing an instance of 'c10::Error'
what(): owning_ptr == NullType::singleton() || owning_ptr->refcount_.load() > 0 ASSERT FAILED at /opt/conda/conda-bld/pytorch_1549628766161/work/c10/util/intrusive_ptr.h:350, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /opt/conda/conda-bld/pytorch_1549628766161/work/c10/util/intrusive_ptr.h:350)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f3920592cf5 in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: THStorage_free + 0xca (0x7f38d72a68ea in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #2: <unknown function> + 0x12c11d (0x7f39208d011d in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #17: __libc_start_main + 0xf0 (0x7f39266a8830 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
The text was updated successfully, but these errors were encountered:
hi!
i have problem about restore checkpoint.
It stopped learning, so I tried to restore but got an error.
help! T^T
The text was updated successfully, but these errors were encountered: