Bad training performance with custom data

Question

Bad training performance with custom data

Opened this issue 2 years ago · 6 comments

nsl2014fm commented 2 years ago

Hi, thanks for your great work.
But, however, I cannot reproduce the result with the test dataset [dance] you provided in README.md, because of lack of some parameters in transforms.json.
Traceback (most recent call last): File "train_nerf.py", line 107, in <module> train_dataset = NeRFDataset(opt.path, type='train', mode=opt.format, bound=opt.bound) File "Instant-NSR-main3/nerf/provider.py", line 106, in __init__ raise RuntimeError('Failed to load focal length, please check the transforms.json!') RuntimeError: Failed to load focal length, please check the transforms.json!

So, I test Instant-NSR code on custom data which in colmap format. But get pure white rendering images.
Here are part of logs:
loss=0.0319 (0.0734), s_val=14.95, lr=0.000496: : 100% 64/64 [00:02<00:00, 22.04it/s] ==> Finished Epoch 1. ==> Start Training Epoch 2, lr=0.000496 ... [density grid] min=0.000000, max=0.000000, mean=0.000000 | [step counter] mean=0 | [SDF] inv_s=512.0000 loss=0.0630 (0.0589), s_val=11.08, lr=0.000493: : 100% 64/64 [00:01<00:00, 39.11it/s] ==> Finished Epoch 2

Thanks a lot!

Answer 1 · 2022-09-27T14:01:03.000Z

Hi, we have updated our data loader. Now you can test our code on the example dataset.

Answer 2 · 2022-09-28T06:54:59.000Z

Hi, we have updated our data loader. Now you can test our code on the example dataset.

Thanks for your relay. I have succuessfully run code on the example data.
However, I still got pure white rendering images as follow:

There must be something wrong. I just set lr from 1e-2 to 1e-5 bacause of NAN loss, while the other params is as offical. Following is part of training log:

==> Start Training Epoch 199, lr=0.000010 ...
[density grid] min=0.0000, max=0.0000, mean=0.0000 | [step counter] mean=27 | [SDF] inv_s=512.0000
loss=0.0045 (0.0131), s_val=1.11, lr=0.000010: : 100% 70/70 [00:01<00:00, 47.89it/s]
==> Finished Epoch 199.
==> Start Training Epoch 200, lr=0.000010 ...
[density grid] min=0.0000, max=0.0000, mean=0.0000 | [step counter] mean=23 | [SDF] inv_s=512.0000
loss=0.0236 (0.0120), s_val=1.10, lr=0.000010: : 100% 70/70 [00:01<00:00, 47.24it/s]
==> Finished Epoch 200.
++> Evaluate at epoch 200 ...
loss=0.0158 (0.0158): : 100% 1/1 [00:00<00:00,  9.05it/s]
++> Evaluate epoch 200 Finished.

Answer 3 · 2022-09-28T12:39:39.000Z

Our code does not support "--cuda_ray" option by now. You may need to run our code using "CUDA_VISIBLE_DEVICES=0 python train_nerf.py INPUT --workspace OUTPUT --downscale 2 --network sdf" instead.

Answer 4 · 2022-11-22T06:36:34.000Z

hello, thx for your great work @zhaofuq .
I have the same problem like @nsl2014fm only that the issue occured when I was using the TCNN network.
did you resolve the problem @nsl2014fm ?
When using sdf network，it performed ok, did this happened to you before? @zhaofuq

Answer 5 · 2022-12-06T05:40:11.000Z

Thanks for your great work too! @zhaofuq But I encountered some error now when using --mode tcnn, can you point out where I got wrong?
Can you run --mode tcnn succesfully? @ZirongChan
When I use --mode tcnn, it got error like following, do you have any idea how to fix that? Thanks!

mycomputer:~/Instant-NSR$ CUDA_VISIBLE_DEVICES=0 python train_nerf.py my_data/bitong_cut/ --workspace test_tcnn --network tcnn
Namespace(bound=1, cuda_ray=False, curvature_loss=False, downscale=1, epoch=200, eval_iter=5, format='colmap', max_ray_batch=4096, mode='train', network='tcnn', num_rays=4096, num_steps=64, path='my_data/bitong_cut/', seed=0, upsample_steps=64, workspace='test_tcnn')
[INFO] Trainer: ngp | 2022-12-06_05-36-22 | cuda:0 | fp32 | test_tcnn
[INFO] #parameters: 12207505
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> Start Training Epoch 1, lr=0.010000 ...
/myhome/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "train_nerf.py", line 120, in <module>
    trainer.train(train_loader, valid_loader, opt.epoch)
  File "/myhome/Instant-NSR/nerf/utils.py", line 438, in train
    self.train_one_epoch(train_loader)
  File "/myhome/Instant-NSR/nerf/utils.py", line 638, in train_one_epoch
    self.scaler.scale(loss).backward()
  File "/myhome/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/myhome/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/myhome/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/myhome/lib/python3.8/site-packages/tinycudann/modules.py", line 112, in backward
    doutput_grad, params_grad, input_grad = ctx.ctx_fwd.native_tcnn_module.bwd_bwd_input(
RuntimeError: DifferentiableObject::backward_backward_input_impl: not implemented error

I use pytorch 1.10.1+cu111, with tinycudann 1.6

Answer 6 · 2023-02-24T08:12:02.000Z

Thanks for your great work too! @zhaofuq But I encountered some error now when using --mode tcnn, can you point out where I got wrong? Can you run --mode tcnn succesfully? @ZirongChan When I use --mode tcnn, it got error like following, do you have any idea how to fix that? Thanks!

mycomputer:~/Instant-NSR$ CUDA_VISIBLE_DEVICES=0 python train_nerf.py my_data/bitong_cut/ --workspace test_tcnn --network tcnn
Namespace(bound=1, cuda_ray=False, curvature_loss=False, downscale=1, epoch=200, eval_iter=5, format='colmap', max_ray_batch=4096, mode='train', network='tcnn', num_rays=4096, num_steps=64, path='my_data/bitong_cut/', seed=0, upsample_steps=64, workspace='test_tcnn')
[INFO] Trainer: ngp | 2022-12-06_05-36-22 | cuda:0 | fp32 | test_tcnn
[INFO] #parameters: 12207505
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> Start Training Epoch 1, lr=0.010000 ...
/myhome/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "train_nerf.py", line 120, in <module>
    trainer.train(train_loader, valid_loader, opt.epoch)
  File "/myhome/Instant-NSR/nerf/utils.py", line 438, in train
    self.train_one_epoch(train_loader)
  File "/myhome/Instant-NSR/nerf/utils.py", line 638, in train_one_epoch
    self.scaler.scale(loss).backward()
  File "/myhome/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/myhome/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/myhome/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/myhome/lib/python3.8/site-packages/tinycudann/modules.py", line 112, in backward
    doutput_grad, params_grad, input_grad = ctx.ctx_fwd.native_tcnn_module.bwd_bwd_input(
RuntimeError: DifferentiableObject::backward_backward_input_impl: not implemented error

I use pytorch 1.10.1+cu111, with tinycudann 1.6

same question, did you solved the problem?