Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hello, I'm having some problems. RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #17

Open
zhangkuncsdn opened this issue Feb 22, 2021 · 11 comments

Comments

@zhangkuncsdn
Copy link

Training model (TrainedModels/pantheon/2021_02_22_15_28_21_generation_train_depth_3_lr_scale_0.1_act_lrelu_0.05)
Training model with the following parameters:
number of stages: 6
number of concurrently trained stages: 3
learning rate scaling: 0.1
non-linearity: lrelu
Training on image pyramid: [torch.Size([1, 3, 26, 42]), torch.Size([1, 3, 31, 51]), torch.Size([1, 3, 40, 66]), torch.Size([1, 3, 57, 94]), torch.Size([1, 3, 106, 175]), torch.Size([1, 3, 152, 250])]

stage [0/5]:: 0%| | 0/1000 [00:00<?, ?it/s]T
raceback (most recent call last):
File "main_train.py", line 118, in
train(opt)
File "G:\ConSinGAN\ConSinGAN\training_generation.py", line 48, in train
fixed_noise, noise_amp, generator, d_curr = train_single_scale(d_curr, generator, reals, fixed_noise, noise_amp, opt, scale_num, writer)
File "G:\ConSinGAN\ConSinGAN\training_generation.py", line 156, in train_single_scale
gradient_penalty = functions.calc_gradient_penalty(netD, real, fake, opt.lambda_grad, opt.device)
File "G:\ConSinGAN\ConSinGAN\functions.py", line 122, in calc_gradient_penalty
create_graph=True, retain_graph=True, only_inputs=True)[0]
File "D:\Anaconda3\envs\ConSinGAN\lib\site-packages\torch\autograd_init_.py", line 149, in grad
inputs, allow_unused)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

@tohinz
Copy link
Owner

tohinz commented Feb 23, 2021

Hi, that looks more like a problem with your Pytorch installation. Are you sure you have the correct CUDA and CUDNN version installed for your graphic card and Pytorch version?

@zhangkuncsdn
Copy link
Author

嗨,这看起来更像是您的Pytorch安装问题。您确定为图形卡和Pytorch版本安装了正确的CUDA和CUDNN版本吗?

Hi, does the current ConsinGAN environment support Pytorch 1.7?

@tohinz
Copy link
Owner

tohinz commented Feb 24, 2021

I haven't tested it with Pytorch 1.7 but in general it should work (I assume at least it would give you a different error message from the one above). The error is thrown at the torch.autograd.grad() function which is why I believe it's a problem with your environment and not with the code itself.
I would suggest running the code on CPU (use flag --not_cuda) to see if it works on CPU or if you get a more informative error message. I haven't tested it on CPU myself so you might have to add .to(torch.device('cpu')) at some points if Pytorch raises errors about GPU/CPU mismatch.

@zhangkuncsdn
Copy link
Author

I haven't tested it with Pytorch 1.7 but in general it should work (I assume at least it would give you a different error message from the one above). The error is thrown at the torch.autograd.grad() function which is why I believe it's a problem with your environment and not with the code itself.
I would suggest running the code on CPU (use flag --not_cuda) to see if it works on CPU or if you get a more informative error message. I haven't tested it on CPU myself so you might have to add .to(torch.device('cpu')) at some points if Pytorch raises errors about GPU/CPU mismatch.

Thank you very much. Use Flag -- Not CUDA can run.There's another question I'd like to ask you.If I want to input a single channel grayscale image for training, how should I modify the network?

@tohinz
Copy link
Owner

tohinz commented Feb 24, 2021

Just set --nc_im 1 and represent your image as shape (H x W x 1), i.e. 1 channel instead of 3 for RGB

@zhangkuncsdn
Copy link
Author

Just set --nc_im 1 and represent your image as shape (H x W x 1), i.e. 1 channel instead of 3 for RG
I've got --nc_im 1, but I'm running into the following problem.
Training model (TrainedModels/07/2021_02_24_22_00_10_generation_train_depth_3_lr_scale_0.1_act_lrelu_0.05)
Training model with the following parameters:
number of stages: 6
number of concurrently trained stages: 3
learning rate scaling: 0.1
non-linearity: lrelu
Traceback (most recent call last):
File "main_train.py", line 118, in
train(opt)
File "G:\ConSinGAN\ConSinGAN\training_generation.py", line 23, in train
real = functions.adjust_scales2image(real, opt)
File "G:\ConSinGAN\ConSinGAN\functions.py", line 185, in adjust_scales2image
real = imresize(real_, opt.scale1, opt)
File "G:\ConSinGAN\ConSinGAN\imresize.py", line 52, in imresize
im = np2torch(im,opt)
File "G:\ConSinGAN\ConSinGAN\imresize.py", line 26, in np2torch
x = color.rgb2gray(x)
File "D:\Anaconda3\envs\ConSinGAN\lib\site-packages\skimage\color\colorconv.py", line 799, in rgb2gray
rgb = _prepare_colorarray(rgb[..., :3])
File "D:\Anaconda3\envs\ConSinGAN\lib\site-packages\skimage\color\colorconv.py", line 152, in _prepare_colorarray
raise ValueError(msg)
ValueError: the input array must be have a shape == (.., ..,[ ..,] 3)), got (164, 250, 1)

@tohinz
Copy link
Owner

tohinz commented Feb 24, 2021

You will have to change the code slightly then to adapt to this.
Another easy work-around is to just convert your gray-scale image to a "color image" with 3 channels, e.g. with OpenCV cv2.cvtColor(gray_img, cv.CV_GRAY2RGB)

@zhangkuncsdn
Copy link
Author

You will have to change the code slightly then to adapt to this.
Another easy work-around is to just convert your gray-scale image to a "color image" with 3 channels, e.g. with OpenCV cv2.cvtColor(gray_img, cv.CV_GRAY2RGB)

There are some problems when I change the code. Can you give me some advice?

@tohinz
Copy link
Owner

tohinz commented Mar 1, 2021

What are the problems?

@FluppyBird
Copy link

I had the same problem 3 days ago, and I used conda install pytorch==1.4.0 torchvision==0.5.0 cudatoolkit=10.1 -c pytorch to unexpectedly ran it. This vision of torch is the same as SinGAN, maybe you can try it. : )

@LiJuanapple
Copy link

Training model (TrainedModels/pantheon/2021_02_22_15_28_21_generation_train_depth_3_lr_scale_0.1_act_lrelu_0.05) Training model with the following parameters: number of stages: 6 number of concurrently trained stages: 3 learning rate scaling: 0.1 non-linearity: lrelu Training on image pyramid: [torch.Size([1, 3, 26, 42]), torch.Size([1, 3, 31, 51]), torch.Size([1, 3, 40, 66]), torch.Size([1, 3, 57, 94]), torch.Size([1, 3, 106, 175]), torch.Size([1, 3, 152, 250])]

stage [0/5]:: 0%| | 0/1000 [00:00<?, ?it/s]T raceback (most recent call last): File "main_train.py", line 118, in train(opt) File "G:\ConSinGAN\ConSinGAN\training_generation.py", line 48, in train fixed_noise, noise_amp, generator, d_curr = train_single_scale(d_curr, generator, reals, fixed_noise, noise_amp, opt, scale_num, writer) File "G:\ConSinGAN\ConSinGAN\training_generation.py", line 156, in train_single_scale gradient_penalty = functions.calc_gradient_penalty(netD, real, fake, opt.lambda_grad, opt.device) File "G:\ConSinGAN\ConSinGAN\functions.py", line 122, in calc_gradient_penalty create_graph=True, retain_graph=True, only_inputs=True)[0] File "D:\Anaconda3\envs\ConSinGAN\lib\site-packages\torch\autograd__init__.py", line 149, in grad inputs, allow_unused) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I have met the same question. At last, how can you resolve the problem? Thank you very much for your sharing and guidance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants