torch Runtime Error on DTU dataset
Opened this issue · 4 comments
Hello,
I am trying to train on the DTU dataset, but I ran into the same error as in issue 13. After applying the fix from that issue (i.e.: changing all FullyFusedMLP to VanillaMLP in the config file) I get the following error from torch:
Global seed set to 42
Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1.0)` was configured so 100% of the batches per epoch will be used..
[rank: 0] Global seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1]
| Name | Type | Params
------------------------------------
0 | model | NeuSModel | 25.2 M
------------------------------------
25.2 M Trainable params
0 Non-trainable params
25.2 M Total params
50.436 Total estimated model params size (MB)
Epoch 0: : 0it [00:00, ?it/s]Using /home/imc/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/imc/.cache/torch_extensions/py38_cu117/segment_cumsum_cuda/build.ninja...
Building extension module segment_cumsum_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module segment_cumsum_cuda...
Epoch 0: : 1it [00:05, 5.33s/it, loss=1.94, train/inv_s=20.10, train/num_rays=405.0]Traceback (most recent call last):
File "launch.py", line 125, in <module>
main()
File "launch.py", line 114, in main
trainer.fit(system, datamodule=dm)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
call._call_and_handle_interrupt(
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch
return function(*args, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
results = self._run_stage()
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
self._run_train()
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train
self.fit_loop.run()
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 370, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/core/module.py", line 1754, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 280, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 234, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 75, in optimizer_step
closure_result = closure()
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 149, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 135, in closure
step_output = self._step_fn()
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 419, in _training_step
training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values())
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 351, in training_step
return self.model(*args, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/imc/miniconda3/envs/insrpl/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 15 16 17 18
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Epoch 0: : 1it [00:06, 6.93s/it, loss=1.94, train/inv_s=20.10, train/num_rays=405.0]
Do you have an idea why this is happening? I ran the command as stated in the readme: python launch.py --config configs/neus-dtu.yaml --gpu 0 --train
I was able to train on the nerf synthetic dataset (also using the fix from issue 13) and it the results are really good!
Please list your environments, including CUDA, PyTorch, Lightning, nerfacc, and tinycudann version. Also, could you try replace strategy=strategy
in launch.py
with strategy=None
and see if it works?
Changing the strategy to None worked, thanks!
Here is my environment for reference:
CUDA: 11.7
PyTorch: 2.0.1
Lightning: 1.9.5
Nerfacc: 0.3.3
tinycudann: 1.7
Please list your environments, including CUDA, PyTorch, Lightning, nerfacc, and tinycudann version. Also, could you try replace
strategy=strategy
inlaunch.py
withstrategy=None
and see if it works?
this works for me! thanks a lot!
It seems that VanillaMLP has some problems when using precision=32
. It's fixed now and you'll be fine without setting strategy=None
.