Turoad/CLRNet

多gpu训练时出现除数为0的现象

Closed this issue · 3 comments

作者您好,首先非常感谢您的杰出工作。
在训练的过程中,我遇到单gpu训练正常,但是多gpu训练会出现除数为0的现象。我阅读这部分代码,并没有发现代码逻辑错误,可以帮忙看看是怎么回事吗?
非常感谢感谢感谢!
2022-10-25 09:26:02,847 - clrnet.utils.recorder - INFO - epoch: 0 step: 1 lr: 0.000600 loss: 12.9223 cls_loss: 5.2943 reg_xytl_loss: 3.2206 seg_loss: 2.6702 iou_loss: 1.7371 stage_0_acc: 93.5330 stage_1_acc: 95.6814 stage_2_acc: 96.4410 stage_3_acc: 96.3216 data: 2.2079 batch: 17.4761 eta: 22 days, 11:16:44
Traceback (most recent call last):
File "main.py", line 75, in
main()
File "main.py", line 39, in main
runner.train()
File "/opt/data/private/Algorithm/lane-detection/clrnet/clrnet/engine/runner.py", line 101, in train
self.train_epoch(epoch, train_loader, tb_writer)
File "/opt/data/private/Algorithm/lane-detection/clrnet/clrnet/engine/runner.py", line 63, in train_epoch
output = self.net(data)
File "/opt/data/private/conda_env/envs/clrnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/data/private/conda_env/envs/clrnet/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 42, in forward
return super().forward(*inputs, **kwargs)
File "/opt/data/private/conda_env/envs/clrnet/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/data/private/conda_env/envs/clrnet/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/data/private/conda_env/envs/clrnet/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/opt/data/private/conda_env/envs/clrnet/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
ZeroDivisionError: Caught ZeroDivisionError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/data/private/conda_env/envs/clrnet/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/opt/data/private/conda_env/envs/clrnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/data/private/Algorithm/lane-detection/clrnet/clrnet/models/nets/detector.py", line 33, in forward
output = self.heads(fea, batch=batch)
File "/opt/data/private/conda_env/envs/clrnet/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/data/private/Algorithm/lane-detection/clrnet/clrnet/models/heads/clr_head.py", line 275, in forward
return self.loss(output, kwargs['batch'])
File "/opt/data/private/Algorithm/lane-detection/clrnet/clrnet/models/heads/clr_head.py", line 454, in loss
cls_acc.append(sum(cls_acc_stage) / len(cls_acc_stage))
ZeroDivisionError: division by zero

Maybe you are using a small batch size, and all samples in this batch have no postive samples.
You can try:
cls_acc.append(sum(cls_acc_stage) / (len(cls_acc_stage))+1e-6).

Hello, thanks for your great work. Here, when dealing with small batch size, cls_acc.append(sum(cls_acc_stage) / (len(cls_acc_stage))+1e-6) still not work, and I think you mean this cls_acc.append((sum(cls_acc_stage) + 1e-6) / (len(cls_acc_stage) + 1))

Hello, thanks for your great work. Here, when dealing with small batch size, cls_acc.append(sum(cls_acc_stage) / (len(cls_acc_stage))+1e-6) still not work, and I think you mean this cls_acc.append((sum(cls_acc_stage) + 1e-6) / (len(cls_acc_stage) + 1))

Hello, thank you for your answer. I have modified it according to your answer, and there is no ZeroDivisionError: division by zero error when batch=4. However, the accuracy values of stage_0_acc, stage_1_acc, and stage_2_acc during the training process have changed from around 97% to around 74/85/88. What impact will this have?