Bad training performance with custom data
Opened this issue · 6 comments
Hi, thanks for your great work.
But, however, I cannot reproduce the result with the test dataset [dance] you provided in README.md, because of lack of some parameters in transforms.json.
Traceback (most recent call last): File "train_nerf.py", line 107, in <module> train_dataset = NeRFDataset(opt.path, type='train', mode=opt.format, bound=opt.bound) File "Instant-NSR-main3/nerf/provider.py", line 106, in __init__ raise RuntimeError('Failed to load focal length, please check the transforms.json!') RuntimeError: Failed to load focal length, please check the transforms.json!
So, I test Instant-NSR code on custom data which in colmap format. But get pure white rendering images.
Here are part of logs:
loss=0.0319 (0.0734), s_val=14.95, lr=0.000496: : 100% 64/64 [00:02<00:00, 22.04it/s] ==> Finished Epoch 1. ==> Start Training Epoch 2, lr=0.000496 ... [density grid] min=0.000000, max=0.000000, mean=0.000000 | [step counter] mean=0 | [SDF] inv_s=512.0000 loss=0.0630 (0.0589), s_val=11.08, lr=0.000493: : 100% 64/64 [00:01<00:00, 39.11it/s] ==> Finished Epoch 2
Thanks a lot!
Hi, we have updated our data loader. Now you can test our code on the example dataset.
Hi, we have updated our data loader. Now you can test our code on the example dataset.
Thanks for your relay. I have succuessfully run code on the example data.
However, I still got pure white rendering images as follow:
There must be something wrong. I just set lr from 1e-2 to 1e-5 bacause of NAN loss, while the other params is as offical. Following is part of training log:
==> Start Training Epoch 199, lr=0.000010 ...
[density grid] min=0.0000, max=0.0000, mean=0.0000 | [step counter] mean=27 | [SDF] inv_s=512.0000
loss=0.0045 (0.0131), s_val=1.11, lr=0.000010: : 100% 70/70 [00:01<00:00, 47.89it/s]
==> Finished Epoch 199.
==> Start Training Epoch 200, lr=0.000010 ...
[density grid] min=0.0000, max=0.0000, mean=0.0000 | [step counter] mean=23 | [SDF] inv_s=512.0000
loss=0.0236 (0.0120), s_val=1.10, lr=0.000010: : 100% 70/70 [00:01<00:00, 47.24it/s]
==> Finished Epoch 200.
++> Evaluate at epoch 200 ...
loss=0.0158 (0.0158): : 100% 1/1 [00:00<00:00, 9.05it/s]
++> Evaluate epoch 200 Finished.
Our code does not support "--cuda_ray" option by now. You may need to run our code using "CUDA_VISIBLE_DEVICES=0 python train_nerf.py INPUT --workspace OUTPUT --downscale 2 --network sdf" instead.
hello, thx for your great work @zhaofuq .
I have the same problem like @nsl2014fm only that the issue occured when I was using the TCNN network.
did you resolve the problem @nsl2014fm ?
When using sdf network,it performed ok, did this happened to you before? @zhaofuq
Thanks for your great work too! @zhaofuq But I encountered some error now when using --mode tcnn, can you point out where I got wrong?
Can you run --mode tcnn succesfully? @ZirongChan
When I use --mode tcnn, it got error like following, do you have any idea how to fix that? Thanks!
mycomputer:~/Instant-NSR$ CUDA_VISIBLE_DEVICES=0 python train_nerf.py my_data/bitong_cut/ --workspace test_tcnn --network tcnn
Namespace(bound=1, cuda_ray=False, curvature_loss=False, downscale=1, epoch=200, eval_iter=5, format='colmap', max_ray_batch=4096, mode='train', network='tcnn', num_rays=4096, num_steps=64, path='my_data/bitong_cut/', seed=0, upsample_steps=64, workspace='test_tcnn')
[INFO] Trainer: ngp | 2022-12-06_05-36-22 | cuda:0 | fp32 | test_tcnn
[INFO] #parameters: 12207505
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> Start Training Epoch 1, lr=0.010000 ...
/myhome/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Traceback (most recent call last):
File "train_nerf.py", line 120, in <module>
trainer.train(train_loader, valid_loader, opt.epoch)
File "/myhome/Instant-NSR/nerf/utils.py", line 438, in train
self.train_one_epoch(train_loader)
File "/myhome/Instant-NSR/nerf/utils.py", line 638, in train_one_epoch
self.scaler.scale(loss).backward()
File "/myhome/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/myhome/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
File "/myhome/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply
return user_fn(self, *args)
File "/myhome/lib/python3.8/site-packages/tinycudann/modules.py", line 112, in backward
doutput_grad, params_grad, input_grad = ctx.ctx_fwd.native_tcnn_module.bwd_bwd_input(
RuntimeError: DifferentiableObject::backward_backward_input_impl: not implemented error
I use pytorch 1.10.1+cu111, with tinycudann 1.6
Thanks for your great work too! @zhaofuq But I encountered some error now when using --mode tcnn, can you point out where I got wrong? Can you run --mode tcnn succesfully? @ZirongChan When I use --mode tcnn, it got error like following, do you have any idea how to fix that? Thanks!
mycomputer:~/Instant-NSR$ CUDA_VISIBLE_DEVICES=0 python train_nerf.py my_data/bitong_cut/ --workspace test_tcnn --network tcnn Namespace(bound=1, cuda_ray=False, curvature_loss=False, downscale=1, epoch=200, eval_iter=5, format='colmap', max_ray_batch=4096, mode='train', network='tcnn', num_rays=4096, num_steps=64, path='my_data/bitong_cut/', seed=0, upsample_steps=64, workspace='test_tcnn') [INFO] Trainer: ngp | 2022-12-06_05-36-22 | cuda:0 | fp32 | test_tcnn [INFO] #parameters: 12207505 [INFO] Loading latest checkpoint ... [WARN] No checkpoint found, model randomly initialized. ==> Start Training Epoch 1, lr=0.010000 ... /myhome/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Traceback (most recent call last): File "train_nerf.py", line 120, in <module> trainer.train(train_loader, valid_loader, opt.epoch) File "/myhome/Instant-NSR/nerf/utils.py", line 438, in train self.train_one_epoch(train_loader) File "/myhome/Instant-NSR/nerf/utils.py", line 638, in train_one_epoch self.scaler.scale(loss).backward() File "/myhome/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/myhome/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward Variable._execution_engine.run_backward( File "/myhome/lib/python3.8/site-packages/torch/autograd/function.py", line 199, in apply return user_fn(self, *args) File "/myhome/lib/python3.8/site-packages/tinycudann/modules.py", line 112, in backward doutput_grad, params_grad, input_grad = ctx.ctx_fwd.native_tcnn_module.bwd_bwd_input( RuntimeError: DifferentiableObject::backward_backward_input_impl: not implemented error
I use pytorch 1.10.1+cu111, with tinycudann 1.6
same question, did you solved the problem?