RUCAIBox/TextBox

[🐛BUG] accelerate多卡训练出错。

Closed this issue · 4 comments

描述这个 bug
accelerate多卡训练出错。

如何复现
accelerate launch run_textbox.py \ --gpu_id=1,3 \ --dataset=csl \ --model=CPT \ --model_path=fnlp/cpt-base \ --saved_dir=./saved/ \ --filename=DEBUG \ --epochs=5 \ --learning_rate=1e-5 \ --train_batch_size=16 \ --eval_batch_size=16 \ --max_save=1 \ --wandb=disabled \ --quick_test=1000 \

日志
13 Feb 11:15 ERROR Traceback (most recent call last): File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/utils/dashboard.py", line 312, in new_experiment yield True File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/quick_start/experiment.py", line 136, in run self._do_train_and_valid() File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/quick_start/experiment.py", line 111, in _do_train_and_valid self.valid_result = self.trainer.fit(train_data, valid_data) File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/trainer/trainer.py", line 451, in fit loss = self._train_epoch(train_data, epoch_idx, valid_data)['loss'] File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/trainer/trainer.py", line 221, in _train_epoch loss = self.model(data, epoch_idx=epoch_idx) File "/home/cqy/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/cqy/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 994, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=Truetotorch.nn.parallel.DistributedDataParallel, and by making sure all forwardfunction outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforwardfunction. Please include the loss function and the structure of the return value offorward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

如果直接用python run_textbox.py单卡跑就没有问题

而且把模型换成BART多卡就没问题,用CPT就有问题

感谢您的报告,这是由于CPT的结构导致的,前向传播中会有不需要更新的参数,这在DDP的时候会出现问题。我们已经更新了代码,你只需要在命令行中添加--find_unused_parameters=True即可。

更新后我这边测试也没问题了