zhaofuq/Instant-NSR

Bad training performance with custom data

Opened this issue · 6 comments

Hi, thanks for your great work.
But, however, I cannot reproduce the result with the test dataset [dance] you provided in README.md, because of lack of some parameters in transforms.json.
Traceback (most recent call last): File "train_nerf.py", line 107, in <module> train_dataset = NeRFDataset(opt.path, type='train', mode=opt.format, bound=opt.bound) File "Instant-NSR-main3/nerf/provider.py", line 106, in __init__ raise RuntimeError('Failed to load focal length, please check the transforms.json!') RuntimeError: Failed to load focal length, please check the transforms.json!

So, I test Instant-NSR code on custom data which in colmap format. But get pure white rendering images.
Here are part of logs:
loss=0.0319 (0.0734), s_val=14.95, lr=0.000496: : 100% 64/64 [00:02<00:00, 22.04it/s] ==> Finished Epoch 1. ==> Start Training Epoch 2, lr=0.000496 ... [density grid] min=0.000000, max=0.000000, mean=0.000000 | [step counter] mean=0 | [SDF] inv_s=512.0000 loss=0.0630 (0.0589), s_val=11.08, lr=0.000493: : 100% 64/64 [00:01<00:00, 39.11it/s] ==> Finished Epoch 2

Thanks a lot!

Hi, we have updated our data loader. Now you can test our code on the example dataset.

Hi, we have updated our data loader. Now you can test our code on the example dataset.

Thanks for your relay. I have succuessfully run code on the example data.
However, I still got pure white rendering images as follow:

image

There must be something wrong. I just set lr from 1e-2 to 1e-5 bacause of NAN loss, while the other params is as offical. Following is part of training log:

==> Start Training Epoch 199, lr=0.000010 ...
[density grid] min=0.0000, max=0.0000, mean=0.0000 | [step counter] mean=27 | [SDF] inv_s=512.0000
loss=0.0045 (0.0131), s_val=1.11, lr=0.000010: : 100% 70/70 [00:01<00:00, 47.89it/s]
==> Finished Epoch 199.
==> Start Training Epoch 200, lr=0.000010 ...
[density grid] min=0.0000, max=0.0000, mean=0.0000 | [step counter] mean=23 | [SDF] inv_s=512.0000
loss=0.0236 (0.0120), s_val=1.10, lr=0.000010: : 100% 70/70 [00:01<00:00, 47.24it/s]
==> Finished Epoch 200.
++> Evaluate at epoch 200 ...
loss=0.0158 (0.0158): : 100% 1/1 [00:00<00:00,  9.05it/s]
++> Evaluate epoch 200 Finished.

Our code does not support "--cuda_ray" option by now. You may need to run our code using "CUDA_VISIBLE_DEVICES=0 python train_nerf.py INPUT --workspace OUTPUT --downscale 2 --network sdf" instead.

hello, thx for your great work @zhaofuq .
I have the same problem like @nsl2014fm only that the issue occured when I was using the TCNN network.
did you resolve the problem @nsl2014fm ?
When using sdf network,it performed ok, did this happened to you before? @zhaofuq

Thanks for your great work too! @zhaofuq But I encountered some error now when using --mode tcnn, can you point out where I got wrong?
Can you run --mode tcnn succesfully? @ZirongChan
When I use --mode tcnn, it got error like following, do you have any idea how to fix that? Thanks!

mycomputer:~/Instant-NSR$ CUDA_VISIBLE_DEVICES=0 python train_nerf.py my_data/bitong_cut/ --workspace test_tcnn --network tcnn
Namespace(bound=1, cuda_ray=False, curvature_loss=False, downscale=1, epoch=200, eval_iter=5, format='colmap', max_ray_batch=4096, mode='train', network='tcnn', num_rays=4096, num_steps=64, path='my_data/bitong_cut/', seed=0, upsample_steps=64, workspace='test_tcnn')
[INFO] Trainer: ngp | 2022-12-06_05-36-22 | cuda:0 | fp32 | test_tcnn
[INFO] #parameters: 12207505
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> Start Training Epoch 1, lr=0.010000 ...
/myhome/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "train_nerf.py", line 120, in <module>
    trainer.train(train_loader, valid_loader, opt.epoch)
  File "/myhome/Instant-NSR/nerf/utils.py", line 438, in train
    self.train_one_epoch(train_loader)
  File "/myhome/Instant-NSR/nerf/utils.py", line 638, in train_one_epoch
    self.scaler.scale(loss).backward()
  File "/myhome/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/myhome/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/myhome/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/myhome/lib/python3.8/site-packages/tinycudann/modules.py", line 112, in backward
    doutput_grad, params_grad, input_grad = ctx.ctx_fwd.native_tcnn_module.bwd_bwd_input(
RuntimeError: DifferentiableObject::backward_backward_input_impl: not implemented error

I use pytorch 1.10.1+cu111, with tinycudann 1.6

Thanks for your great work too! @zhaofuq But I encountered some error now when using --mode tcnn, can you point out where I got wrong? Can you run --mode tcnn succesfully? @ZirongChan When I use --mode tcnn, it got error like following, do you have any idea how to fix that? Thanks!

mycomputer:~/Instant-NSR$ CUDA_VISIBLE_DEVICES=0 python train_nerf.py my_data/bitong_cut/ --workspace test_tcnn --network tcnn
Namespace(bound=1, cuda_ray=False, curvature_loss=False, downscale=1, epoch=200, eval_iter=5, format='colmap', max_ray_batch=4096, mode='train', network='tcnn', num_rays=4096, num_steps=64, path='my_data/bitong_cut/', seed=0, upsample_steps=64, workspace='test_tcnn')
[INFO] Trainer: ngp | 2022-12-06_05-36-22 | cuda:0 | fp32 | test_tcnn
[INFO] #parameters: 12207505
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> Start Training Epoch 1, lr=0.010000 ...
/myhome/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "train_nerf.py", line 120, in <module>
    trainer.train(train_loader, valid_loader, opt.epoch)
  File "/myhome/Instant-NSR/nerf/utils.py", line 438, in train
    self.train_one_epoch(train_loader)
  File "/myhome/Instant-NSR/nerf/utils.py", line 638, in train_one_epoch
    self.scaler.scale(loss).backward()
  File "/myhome/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/myhome/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/myhome/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
    return user_fn(self, *args)
  File "/myhome/lib/python3.8/site-packages/tinycudann/modules.py", line 112, in backward
    doutput_grad, params_grad, input_grad = ctx.ctx_fwd.native_tcnn_module.bwd_bwd_input(
RuntimeError: DifferentiableObject::backward_backward_input_impl: not implemented error

I use pytorch 1.10.1+cu111, with tinycudann 1.6

same question, did you solved the problem?