Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered #6

Open
wsonia opened this issue Dec 8, 2021 · 4 comments
Open

Comments

@wsonia
Copy link

wsonia commented Dec 8, 2021

I deploy the same environment and use the public cardiac data to run the code. But got this problem while training:
Validation sanity check: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 137, in
trainer.fit(net, datamodule=data_module)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
self._call_and_handle_interrupt(
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
self._dispatch()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _dispatch
self.training_type_plugin.start_training(self)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1284, in run_stage
return self._run_train()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1306, in _run_train
self._run_sanity_check(self.lightning_module)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1370, in _run_sanity_check
self._evaluation_loop.run()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 109, in advance
dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
output = self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 236, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 444, in validation_step
return self.model(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 92, in forward
output = self.module.validation_step(*inputs, **kwargs)
File "/3D-UCaps-main/module/ucaps.py", line 265, in validation_step
val_outputs = sliding_window_inference(
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/monai/inferers/utils.py", line 130, in sliding_window_inference
seg_prob = predictor(window_data, *args, **kwargs).to(device) # batched patch segmentation
File "/3D-UCaps-main/module/ucaps.py", line 171, in forward
x = self.feature_extractor(x)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 572, in forward
return F.conv3d(input, self.weight, self.bias, self.stride,
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8
./train_ucaps_cardiac.sh: line 25: 171684 Aborted (core dumped) python train.py --log_dir ./3D-UCaps-main/logs_heart --gpus 1 --accelerator ddp --check_val_every_n_epoch 5 --max_epochs 100 --dataset task02_heart --model_name ucaps --root_dir ./3D-UCaps-main/Task02_Heart --fold 0 --cache_rate 1.0 --train_patch_size 128 128 128 --num_workers 64 --batch_size 1 --share_weight 0 --num_samples 1 --in_channels 1 --out_channels 2 --val_patch_size(UCaps

@hoangtan96dl
Copy link
Contributor

hoangtan96dl commented Dec 11, 2021

hello @wentj897, can I know which device you are using? From my experience, it can be due to out of memory error, you can reduce memory requirement by:

  • Reduce batch_size, num_samples, train_patch_size for training step
  • Reduce sw_batch_size, val_patch_size for validation step

You are having a problem with the validation step so I think you should reduce val_patch_size to a smaller number like 64x64x64 that can fit with your device. Additionally, the training step will use even larger memory so I think you should reduce them also.

@kingjames1155
Copy link

thank you very much for your open source code. I also encountered this problem in the training stage. How do you solve it? My GPU is 3090, cuda11.3.I've tried to reduce batch_size, num_samples,train_patch_size, but it not work.

@kingjames1155
Copy link

path/to/luna/
imgs
segs

Is the file extracted from subset0-subset9 stored in folder imgs?
Is the extracted file seg-luns-luna16 stored in folder segs?
Do they need any other pretreatment?

@hoangtan96dl
Copy link
Contributor

Hello @kingjames1155 , sorry for my late reply.
For your first question, if you have the same problem as @wentj897 , actually you are getting the error at the validation step, not the training step. Hence, try to reduce the val_patch_size and sw_batch_size first to see if it can solve the problem.
For your second question, the answer is yes, you just need to extract LUNA dataset as it is. I also point out some error files in the README that you need to remove.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants