bennyguo/instant-nsr-pl

torch Runtime Error on DTU dataset

Opened this issue · 4 comments

Hello,
I am trying to train on the DTU dataset, but I ran into the same error as in issue 13. After applying the fix from that issue (i.e.: changing all FullyFusedMLP to VanillaMLP in the config file) I get the following error from torch:

Global seed set to 42
Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]

  | Name  | Type      | Params
------------------------------------
0 | model | NeuSModel | 25.2 M
------------------------------------
25.2 M    Trainable params
0         Non-trainable params
25.2 M    Total params
50.436    Total estimated model params size (MB)
Epoch 0: : 0it [00:00, ?it/s]Using /home/imc/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/imc/.cache/torch_extensions/py38_cu117/segment_cumsum_cuda/build.ninja...
Building extension module segment_cumsum_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module segment_cumsum_cuda...
Epoch 0: : 1it [00:05,  5.33s/it, loss=1.94, train/inv_s=20.10, train/num_rays=405.0]Traceback (most recent call last):
  File "launch.py", line 125, in <module>
    main()
  File "launch.py", line 114, in main
    trainer.fit(system, datamodule=dm)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
    return function(*args, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
    self._run_train()
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
    self.fit_loop.run()
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance
    batch_output = self.batch_loop.run(kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(optimizers, kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance
    result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 370, in _optimizer_step
    self.trainer._call_lightning_module_hook(
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1754, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
    step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 280, in optimizer_step
    optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
    return self.precision_plugin.optimizer_step(
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 75, in optimizer_step
    closure_result = closure()
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 149, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 135, in closure
    step_output = self._step_fn()
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 419, in _training_step
    training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 351, in training_step
    return self.model(*args, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 15 16 17 18
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Epoch 0: : 1it [00:06,  6.93s/it, loss=1.94, train/inv_s=20.10, train/num_rays=405.0]

Do you have an idea why this is happening? I ran the command as stated in the readme: python launch.py --config configs/neus-dtu.yaml --gpu 0 --train

I was able to train on the nerf synthetic dataset (also using the fix from issue 13) and it the results are really good!

Please list your environments, including CUDA, PyTorch, Lightning, nerfacc, and tinycudann version. Also, could you try replace strategy=strategy in launch.py with strategy=None and see if it works?

Changing the strategy to None worked, thanks!
Here is my environment for reference:

CUDA: 11.7
PyTorch: 2.0.1
Lightning: 1.9.5
Nerfacc: 0.3.3
tinycudann: 1.7

Please list your environments, including CUDA, PyTorch, Lightning, nerfacc, and tinycudann version. Also, could you try replace strategy=strategy in launch.py with strategy=None and see if it works?

this works for me! thanks a lot!

It seems that VanillaMLP has some problems when using precision=32. It's fixed now and you'll be fine without setting strategy=None.