-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA error: an illegal memory access was encountered #6
Comments
hello @wentj897, can I know which device you are using? From my experience, it can be due to out of memory error, you can reduce memory requirement by:
You are having a problem with the validation step so I think you should reduce |
thank you very much for your open source code. I also encountered this problem in the training stage. How do you solve it? My GPU is 3090, cuda11.3.I've tried to reduce batch_size, num_samples,train_patch_size, but it not work. |
path/to/luna/ Is the file extracted from subset0-subset9 stored in folder imgs? |
Hello @kingjames1155 , sorry for my late reply. |
I deploy the same environment and use the public cardiac data to run the code. But got this problem while training:
Validation sanity check: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 137, in
trainer.fit(net, datamodule=data_module)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
self._call_and_handle_interrupt(
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
self._dispatch()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _dispatch
self.training_type_plugin.start_training(self)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1284, in run_stage
return self._run_train()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1306, in _run_train
self._run_sanity_check(self.lightning_module)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1370, in _run_sanity_check
self._evaluation_loop.run()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 109, in advance
dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
output = self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 236, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 444, in validation_step
return self.model(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 92, in forward
output = self.module.validation_step(*inputs, **kwargs)
File "/3D-UCaps-main/module/ucaps.py", line 265, in validation_step
val_outputs = sliding_window_inference(
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/monai/inferers/utils.py", line 130, in sliding_window_inference
seg_prob = predictor(window_data, *args, **kwargs).to(device) # batched patch segmentation
File "/3D-UCaps-main/module/ucaps.py", line 171, in forward
x = self.feature_extractor(x)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 572, in forward
return F.conv3d(input, self.weight, self.bias, self.stride,
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8
./train_ucaps_cardiac.sh: line 25: 171684 Aborted (core dumped) python train.py --log_dir ./3D-UCaps-main/logs_heart --gpus 1 --accelerator ddp --check_val_every_n_epoch 5 --max_epochs 100 --dataset task02_heart --model_name ucaps --root_dir ./3D-UCaps-main/Task02_Heart --fold 0 --cache_rate 1.0 --train_patch_size 128 128 128 --num_workers 64 --batch_size 1 --share_weight 0 --num_samples 1 --in_channels 1 --out_channels 2 --val_patch_size(UCaps
The text was updated successfully, but these errors were encountered: