Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad training performance with custom data #2

Open
nsl2014fm opened this issue Sep 27, 2022 · 6 comments
Open

Bad training performance with custom data #2

nsl2014fm opened this issue Sep 27, 2022 · 6 comments

Comments

@nsl2014fm
Copy link

Hi, thanks for your great work.
But, however, I cannot reproduce the result with the test dataset [dance] you provided in README.md, because of lack of some parameters in transforms.json.
Traceback (most recent call last): File "train_nerf.py", line 107, in <module> train_dataset = NeRFDataset(opt.path, type='train', mode=opt.format, bound=opt.bound) File "Instant-NSR-main3/nerf/provider.py", line 106, in __init__ raise RuntimeError('Failed to load focal length, please check the transforms.json!') RuntimeError: Failed to load focal length, please check the transforms.json!

So, I test Instant-NSR code on custom data which in colmap format. But get pure white rendering images.
Here are part of logs:
loss=0.0319 (0.0734), s_val=14.95, lr=0.000496: : 100% 64/64 [00:02<00:00, 22.04it/s] ==> Finished Epoch 1. ==> Start Training Epoch 2, lr=0.000496 ... [density grid] min=0.000000, max=0.000000, mean=0.000000 | [step counter] mean=0 | [SDF] inv_s=512.0000 loss=0.0630 (0.0589), s_val=11.08, lr=0.000493: : 100% 64/64 [00:01<00:00, 39.11it/s] ==> Finished Epoch 2

Thanks a lot!

@zhaofuq
Copy link
Owner

zhaofuq commented Sep 27, 2022

Hi, we have updated our data loader. Now you can test our code on the example dataset.

@nsl2014fm
Copy link
Author

Hi, we have updated our data loader. Now you can test our code on the example dataset.

Thanks for your relay. I have succuessfully run code on the example data.
However, I still got pure white rendering images as follow:

image

There must be something wrong. I just set lr from 1e-2 to 1e-5 bacause of NAN loss, while the other params is as offical. Following is part of training log:

==> Start Training Epoch 199, lr=0.000010 ...
[density grid] min=0.0000, max=0.0000, mean=0.0000 | [step counter] mean=27 | [SDF] inv_s=512.0000
loss=0.0045 (0.0131), s_val=1.11, lr=0.000010: : 100% 70/70 [00:01<00:00, 47.89it/s]
==> Finished Epoch 199.
==> Start Training Epoch 200, lr=0.000010 ...
[density grid] min=0.0000, max=0.0000, mean=0.0000 | [step counter] mean=23 | [SDF] inv_s=512.0000
loss=0.0236 (0.0120), s_val=1.10, lr=0.000010: : 100% 70/70 [00:01<00:00, 47.24it/s]
==> Finished Epoch 200.
++> Evaluate at epoch 200 ...
loss=0.0158 (0.0158): : 100% 1/1 [00:00<00:00,  9.05it/s]
++> Evaluate epoch 200 Finished.

@zhaofuq
Copy link
Owner

zhaofuq commented Sep 28, 2022

Our code does not support "--cuda_ray" option by now. You may need to run our code using "CUDA_VISIBLE_DEVICES=0 python train_nerf.py INPUT --workspace OUTPUT --downscale 2 --network sdf" instead.

@ZirongChan
Copy link

hello, thx for your great work @zhaofuq .
I have the same problem like @nsl2014fm only that the issue occured when I was using the TCNN network.
did you resolve the problem @nsl2014fm ?
When using sdf network,it performed ok, did this happened to you before? @zhaofuq

@zoezhu
Copy link

zoezhu commented Dec 6, 2022

Thanks for your great work too! @zhaofuq But I encountered some error now when using --mode tcnn, can you point out where I got wrong?
Can you run --mode tcnn succesfully? @ZirongChan
When I use --mode tcnn, it got error like following, do you have any idea how to fix that? Thanks!

mycomputer:~/Instant-NSR$ CUDA_VISIBLE_DEVICES=0 python train_nerf.py my_data/bitong_cut/ --workspace test_tcnn --network tcnn
Namespace(bound=1, cuda_ray=False, curvature_loss=False, downscale=1, epoch=200, eval_iter=5, format='colmap', max_ray_batch=4096, mode='train', network='tcnn', num_rays=4096, num_steps=64, path='my_data/bitong_cut/', seed=0, upsample_steps=64, workspace='test_tcnn')
[INFO] Trainer: ngp | 2022-12-06_05-36-22 | cuda:0 | fp32 | test_tcnn
[INFO] #parameters: 12207505
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> Start Training Epoch 1, lr=0.010000 ...
/myhome/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "train_nerf.py", line 120, in <module>
    trainer.train(train_loader, valid_loader, opt.epoch)
  File "/myhome/Instant-NSR/nerf/utils.py", line 438, in train
    self.train_one_epoch(train_loader)
  File "/myhome/Instant-NSR/nerf/utils.py", line 638, in train_one_epoch
    self.scaler.scale(loss).backward()
  File "/myhome/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/myhome/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/myhome/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/myhome/lib/python3.8/site-packages/tinycudann/modules.py", line 112, in backward
    doutput_grad, params_grad, input_grad = ctx.ctx_fwd.native_tcnn_module.bwd_bwd_input(
RuntimeError: DifferentiableObject::backward_backward_input_impl: not implemented error

I use pytorch 1.10.1+cu111, with tinycudann 1.6

@1406428260
Copy link

Thanks for your great work too! @zhaofuq But I encountered some error now when using --mode tcnn, can you point out where I got wrong? Can you run --mode tcnn succesfully? @ZirongChan When I use --mode tcnn, it got error like following, do you have any idea how to fix that? Thanks!

mycomputer:~/Instant-NSR$ CUDA_VISIBLE_DEVICES=0 python train_nerf.py my_data/bitong_cut/ --workspace test_tcnn --network tcnn
Namespace(bound=1, cuda_ray=False, curvature_loss=False, downscale=1, epoch=200, eval_iter=5, format='colmap', max_ray_batch=4096, mode='train', network='tcnn', num_rays=4096, num_steps=64, path='my_data/bitong_cut/', seed=0, upsample_steps=64, workspace='test_tcnn')
[INFO] Trainer: ngp | 2022-12-06_05-36-22 | cuda:0 | fp32 | test_tcnn
[INFO] #parameters: 12207505
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> Start Training Epoch 1, lr=0.010000 ...
/myhome/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "train_nerf.py", line 120, in <module>
    trainer.train(train_loader, valid_loader, opt.epoch)
  File "/myhome/Instant-NSR/nerf/utils.py", line 438, in train
    self.train_one_epoch(train_loader)
  File "/myhome/Instant-NSR/nerf/utils.py", line 638, in train_one_epoch
    self.scaler.scale(loss).backward()
  File "/myhome/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/myhome/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/myhome/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/myhome/lib/python3.8/site-packages/tinycudann/modules.py", line 112, in backward
    doutput_grad, params_grad, input_grad = ctx.ctx_fwd.native_tcnn_module.bwd_bwd_input(
RuntimeError: DifferentiableObject::backward_backward_input_impl: not implemented error

I use pytorch 1.10.1+cu111, with tinycudann 1.6

same question, did you solved the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants