Bad training performance with custom data #2

nsl2014fm · 2022-09-27T03:18:57Z

Hi, thanks for your great work.
But, however, I cannot reproduce the result with the test dataset [dance] you provided in README.md, because of lack of some parameters in transforms.json.
Traceback (most recent call last): File "train_nerf.py", line 107, in <module> train_dataset = NeRFDataset(opt.path, type='train', mode=opt.format, bound=opt.bound) File "Instant-NSR-main3/nerf/provider.py", line 106, in __init__ raise RuntimeError('Failed to load focal length, please check the transforms.json!') RuntimeError: Failed to load focal length, please check the transforms.json!

So, I test Instant-NSR code on custom data which in colmap format. But get pure white rendering images.
Here are part of logs:
loss=0.0319 (0.0734), s_val=14.95, lr=0.000496: : 100% 64/64 [00:02<00:00, 22.04it/s] ==> Finished Epoch 1. ==> Start Training Epoch 2, lr=0.000496 ... [density grid] min=0.000000, max=0.000000, mean=0.000000 | [step counter] mean=0 | [SDF] inv_s=512.0000 loss=0.0630 (0.0589), s_val=11.08, lr=0.000493: : 100% 64/64 [00:01<00:00, 39.11it/s] ==> Finished Epoch 2

Thanks a lot!

The text was updated successfully, but these errors were encountered:

zhaofuq · 2022-09-27T14:01:03Z

Hi, we have updated our data loader. Now you can test our code on the example dataset.

nsl2014fm · 2022-09-28T06:54:59Z

Hi, we have updated our data loader. Now you can test our code on the example dataset.

Thanks for your relay. I have succuessfully run code on the example data.
However, I still got pure white rendering images as follow:

There must be something wrong. I just set lr from 1e-2 to 1e-5 bacause of NAN loss, while the other params is as offical. Following is part of training log:

==> Start Training Epoch 199, lr=0.000010 ...
[density grid] min=0.0000, max=0.0000, mean=0.0000 | [step counter] mean=27 | [SDF] inv_s=512.0000
loss=0.0045 (0.0131), s_val=1.11, lr=0.000010: : 100% 70/70 [00:01<00:00, 47.89it/s]
==> Finished Epoch 199.
==> Start Training Epoch 200, lr=0.000010 ...
[density grid] min=0.0000, max=0.0000, mean=0.0000 | [step counter] mean=23 | [SDF] inv_s=512.0000
loss=0.0236 (0.0120), s_val=1.10, lr=0.000010: : 100% 70/70 [00:01<00:00, 47.24it/s]
==> Finished Epoch 200.
++> Evaluate at epoch 200 ...
loss=0.0158 (0.0158): : 100% 1/1 [00:00<00:00,  9.05it/s]
++> Evaluate epoch 200 Finished.

zhaofuq · 2022-09-28T12:39:39Z

Our code does not support "--cuda_ray" option by now. You may need to run our code using "CUDA_VISIBLE_DEVICES=0 python train_nerf.py INPUT --workspace OUTPUT --downscale 2 --network sdf" instead.

ZirongChan · 2022-11-22T06:36:34Z

hello, thx for your great work @zhaofuq .
I have the same problem like @nsl2014fm only that the issue occured when I was using the TCNN network.
did you resolve the problem @nsl2014fm ?
When using sdf network，it performed ok, did this happened to you before? @zhaofuq

zoezhu · 2022-12-06T05:40:11Z

Thanks for your great work too! @zhaofuq But I encountered some error now when using --mode tcnn, can you point out where I got wrong?
Can you run --mode tcnn succesfully? @ZirongChan
When I use --mode tcnn, it got error like following, do you have any idea how to fix that? Thanks!

mycomputer:~/Instant-NSR$ CUDA_VISIBLE_DEVICES=0 python train_nerf.py my_data/bitong_cut/ --workspace test_tcnn --network tcnn
Namespace(bound=1, cuda_ray=False, curvature_loss=False, downscale=1, epoch=200, eval_iter=5, format='colmap', max_ray_batch=4096, mode='train', network='tcnn', num_rays=4096, num_steps=64, path='my_data/bitong_cut/', seed=0, upsample_steps=64, workspace='test_tcnn')
[INFO] Trainer: ngp | 2022-12-06_05-36-22 | cuda:0 | fp32 | test_tcnn
[INFO] #parameters: 12207505
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> Start Training Epoch 1, lr=0.010000 ...
/myhome/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "train_nerf.py", line 120, in <module>
    trainer.train(train_loader, valid_loader, opt.epoch)
  File "/myhome/Instant-NSR/nerf/utils.py", line 438, in train
    self.train_one_epoch(train_loader)
  File "/myhome/Instant-NSR/nerf/utils.py", line 638, in train_one_epoch
    self.scaler.scale(loss).backward()
  File "/myhome/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/myhome/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/myhome/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/myhome/lib/python3.8/site-packages/tinycudann/modules.py", line 112, in backward
    doutput_grad, params_grad, input_grad = ctx.ctx_fwd.native_tcnn_module.bwd_bwd_input(
RuntimeError: DifferentiableObject::backward_backward_input_impl: not implemented error

I use pytorch 1.10.1+cu111, with tinycudann 1.6

1406428260 · 2023-02-24T08:12:02Z

Thanks for your great work too! @zhaofuq But I encountered some error now when using --mode tcnn, can you point out where I got wrong? Can you run --mode tcnn succesfully? @ZirongChan When I use --mode tcnn, it got error like following, do you have any idea how to fix that? Thanks!

mycomputer:~/Instant-NSR$ CUDA_VISIBLE_DEVICES=0 python train_nerf.py my_data/bitong_cut/ --workspace test_tcnn --network tcnn
Namespace(bound=1, cuda_ray=False, curvature_loss=False, downscale=1, epoch=200, eval_iter=5, format='colmap', max_ray_batch=4096, mode='train', network='tcnn', num_rays=4096, num_steps=64, path='my_data/bitong_cut/', seed=0, upsample_steps=64, workspace='test_tcnn')
[INFO] Trainer: ngp | 2022-12-06_05-36-22 | cuda:0 | fp32 | test_tcnn
[INFO] #parameters: 12207505
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> Start Training Epoch 1, lr=0.010000 ...
/myhome/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "train_nerf.py", line 120, in <module>
    trainer.train(train_loader, valid_loader, opt.epoch)
  File "/myhome/Instant-NSR/nerf/utils.py", line 438, in train
    self.train_one_epoch(train_loader)
  File "/myhome/Instant-NSR/nerf/utils.py", line 638, in train_one_epoch
    self.scaler.scale(loss).backward()
  File "/myhome/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/myhome/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/myhome/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/myhome/lib/python3.8/site-packages/tinycudann/modules.py", line 112, in backward
    doutput_grad, params_grad, input_grad = ctx.ctx_fwd.native_tcnn_module.bwd_bwd_input(
RuntimeError: DifferentiableObject::backward_backward_input_impl: not implemented error

I use pytorch 1.10.1+cu111, with tinycudann 1.6

same question, did you solved the problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad training performance with custom data #2

Bad training performance with custom data #2

nsl2014fm commented Sep 27, 2022

zhaofuq commented Sep 27, 2022

nsl2014fm commented Sep 28, 2022

zhaofuq commented Sep 28, 2022

ZirongChan commented Nov 22, 2022

zoezhu commented Dec 6, 2022 •

edited

Loading

1406428260 commented Feb 24, 2023

Bad training performance with custom data #2

Bad training performance with custom data #2

Comments

nsl2014fm commented Sep 27, 2022

zhaofuq commented Sep 27, 2022

nsl2014fm commented Sep 28, 2022

zhaofuq commented Sep 28, 2022

ZirongChan commented Nov 22, 2022

zoezhu commented Dec 6, 2022 • edited Loading

1406428260 commented Feb 24, 2023

zoezhu commented Dec 6, 2022 •

edited

Loading