lucidrains/byol-pytorch

Error when use 2 gpus

superwj1990 opened this issue ยท 4 comments

Hi! When I try to run the Pytorch lightning example code, I get the following error. Any idea how to fix this?

Epoch 0: 50%|#########################################################################5 | 1/2 [00:06<00:06, 6.06s/it, loss=3.94, v_num=10$
Traceback (most recent call last):
File "train.py", line 118, in
trainer.fit(model, train_loader)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 513, in fit
self.dispatch()
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in dispatch
self.accelerator.start_training(self)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 74, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 111, in start_training
self._results = trainer.run_train()
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 644, in run_train
self.train_loop.run_training_epoch()
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 650, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
using_lbfgs=is_lbfgs,
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1384, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 219, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 135, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 278, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 283, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 160, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/torch/optim/adam.py", line 66, in step
loss = closure()
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 645, in train_step_and_backward_closure
split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 293, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 157, in training_step
return self.training_type_plugin.training_step(*args)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 287, in training_step
return self.model(*args, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ssd1/wangjian_code/byol_pytorch/py3_env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 606, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused
parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating los
s. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss
function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

The same question.

I guess you guys are using pytorch-lightning==1.2.1.
Downgrading to 1.1.8 may fix this problem.

The same question.

When learning this byol code by Distributed Dataparallel with multiprocessing method on pytorch=1.7.0, the error like above occurs.

The same question.

When learning this byol code by Distributed Dataparallel with multiprocessing method on pytorch=1.7.0, the error like above occurs.

Same problem. Have you got it sovled yet?