Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i have problem about restore checkpoint! #2

Open
jaeyun95 opened this issue Jan 22, 2020 · 2 comments
Open

i have problem about restore checkpoint! #2

jaeyun95 opened this issue Jan 22, 2020 · 2 comments

Comments

@jaeyun95
Copy link

hi!
i have problem about restore checkpoint.
It stopped learning, so I tried to restore but got an error.
help! T^T

restore is True
Found folder! restoring
Traceback (most recent call last):
  File "train.py", line 122, in <module>
    learning_rate_scheduler=scheduler)
  File "/home/ailab/HGL-pytorch/utils/pytorch_misc.py", line 226, in restore_checkpoint
    training_state = torch.load(training_state_path, map_location=device_mapping(-1))
  File "/home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/serialization.py", line 368, in load
    return _load(f, map_location, pickle_module)
  File "/home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/serialization.py", line 549, in _load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 4859355 more bytes. The file might be corrupted.
terminate called after throwing an instance of 'c10::Error'
  what():  owning_ptr == NullType::singleton() || owning_ptr->refcount_.load() > 0 ASSERT FAILED at /opt/conda/conda-bld/pytorch_1549628766161/work/c10/util/intrusive_ptr.h:350, please report a bug to PyTorch. intrusive_ptr: Can only intrusive_ptr::reclaim() owning pointers that were created using intrusive_ptr::release(). (reclaim at /opt/conda/conda-bld/pytorch_1549628766161/work/c10/util/intrusive_ptr.h:350)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f3920592cf5 in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: THStorage_free + 0xca (0x7f38d72a68ea in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
frame #2: <unknown function> + 0x12c11d (0x7f39208d011d in /home/ailab/anaconda3/envs/r2c/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #17: __libc_start_main + 0xf0 (0x7f39266a8830 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
@yuweijiang
Copy link
Owner

It seems that you loaded an uncompleted file. Could you check the saving path of your checkpoint to make sure whether the checkpoint is saved?

@tuyunbin
Copy link

Hi, I want to know how many GPU memories do you use for successfully running this code? @jaeyun95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants