lxtGH/SFSegNets

Error: CUDA out of memory

daixiaolei623 opened this issue · 16 comments

@lxtGH
thank you for your work.
when i train sfnet_res101 on my own data,i got the error: CUDA outof memory, my own training image is 2710x3384, i have two GTX1080Ti which have about 11G,
Could you please tell me how to solve it.
thank you.

lxtGH commented

@daixiaolei623 Hi! Please change your batch size to 8.

@lxtGH
thank you, you mean the bs_mult=2 in train_cityscapes_sfnet_res101.sh? or, could you please tell me where the batch_size in your code.

@lxtGH
Do you mean i should change the 'bs_mult=2' to 'bs_mult=8' in train_cityscapes_sfnet_res101.sh?
thank you.

lxtGH commented

change it into 'bs_mult=1' or lower down the crop size.

@lxtGH
同学你好,我没有改动任何文件,还是在cityscapes数据库上训练,只是改动了使用两张1080Ti(--nproc_per_node=2)训练你的sfnet_res101.sh:
但是在epoch 0时一切正常,可到epoch 1时就出现了train main loss 0.000000,NaN的问题:
05-04 14:36:03.238 [epoch 0], [iter 736 / 744], [train main loss 0.982473], [lr 0.009967] 05-04 14:36:05.153 [epoch 0], [iter 737 / 744], [train main loss 0.981140], [lr 0.009967] 05-04 14:36:07.020 [epoch 0], [iter 738 / 744], [train main loss 0.979811], [lr 0.009967] 05-04 14:36:08.874 [epoch 0], [iter 739 / 744], [train main loss 0.978485], [lr 0.009967] 05-04 14:36:10.757 [epoch 0], [iter 740 / 744], [train main loss 0.977163], [lr 0.009967] 05-04 14:36:12.671 [epoch 0], [iter 741 / 744], [train main loss 0.975844], [lr 0.009967] 05-04 14:36:14.587 [epoch 0], [iter 742 / 744], [train main loss 0.974529], [lr 0.009967] 05-04 14:36:16.509 [epoch 0], [iter 743 / 744], [train main loss 0.973217], [lr 0.009967] 05-04 14:36:18.422 [epoch 0], [iter 744 / 744], [train main loss 0.971909], [lr 0.009967] /home/dai/code/semantic_segmentation/23/original_train/SFSegNets-master/loss.py:111: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. return self.nll_loss(F.log_softmax(inputs), targets) /home/dai/code/semantic_segmentation/23/original_train/SFSegNets-master/loss.py:111: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. return self.nll_loss(F.log_softmax(inputs), targets) 05-04 14:36:23.505 validating: 1 / 250 05-04 14:36:28.132 validating: 21 / 250 05-04 14:36:32.666 validating: 41 / 250 05-04 14:36:37.171 validating: 61 / 250 05-04 14:36:41.790 validating: 81 / 250 05-04 14:36:46.273 validating: 101 / 250 05-04 14:36:50.794 validating: 121 / 250 05-04 14:36:55.230 validating: 141 / 250 05-04 14:36:59.742 validating: 161 / 250 05-04 14:37:04.240 validating: 181 / 250 05-04 14:37:08.797 validating: 201 / 250 05-04 14:37:13.382 validating: 221 / 250 05-04 14:37:17.878 validating: 241 / 250 05-04 14:37:19.871 IoU: 05-04 14:37:19.872 label_id label iU Precision Recall TP FP FN 05-04 14:37:19.872 0 0 37.65 1.00 0.38 37.65 0.00 1.66 /home/dai/code/semantic_segmentation/23/original_train/SFSegNets-master/utils/misc.py:275: RuntimeWarning: divide by zero encountered in float_scalars iu_false_positive[idx] / iu_true_positive[idx]) /home/dai/code/semantic_segmentation/23/original_train/SFSegNets-master/utils/misc.py:276: RuntimeWarning: invalid value encountered in float_scalars fn = '{:5.2f}'.format(iu_false_negative[idx] / iu_true_positive[idx]) /home/dai/code/semantic_segmentation/23/original_train/SFSegNets-master/utils/misc.py:280: RuntimeWarning: invalid value encountered in float_scalars iu_true_positive[idx] / (iu_true_positive[idx] + iu_false_negative[idx])) 05-04 14:37:19.886 1 1 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.886 2 2 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.886 3 3 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.886 4 4 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 5 5 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 6 6 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 7 7 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 8 8 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 9 9 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 10 10 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 11 11 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 12 12 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 13 13 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 14 14 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 15 15 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 16 16 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 17 17 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 18 18 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 mean 0.019816191866993904 /home/dai/anaconda3/envs/SFSegNets/lib/python3.7/site-packages/torchvision/transforms/transforms.py:210: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead. warnings.warn("The use of the transforms.Scale transform is deprecated, " + 05-04 14:37:21.910 ----------------------------------------------------------------------------------------------------------- 05-04 14:37:21.910 [epoch 0], [val loss nan], [acc 0.37651], [acc_cls 0.05263], [mean_iu 0.01982], [fwavacc 0.14176] 05-04 14:37:21.911 best record: [val loss nan], [acc 0.37651], [acc_cls 0.05263], [mean_iu 0.01982], [fwavacc 0.14176], [epoch 0], 05-04 14:37:21.911 ----------------------------------------------------------------------------------------------------------- 05-04 14:37:21.911 NaN or Inf found in input tensor. 05-04 14:37:21.928 Class Uniform Percentage: 0.5 05-04 14:37:21.934 Class Uniform items per Epoch:2975 05-04 14:37:21.940 cls 0 len 5866 05-04 14:37:21.940 cls 1 len 5184 05-04 14:37:21.940 cls 2 len 5678 05-04 14:37:21.945 cls 3 len 1312 05-04 14:37:21.956 cls 4 len 1723 05-04 14:37:21.967 cls 5 len 5656 05-04 14:37:21.979 cls 6 len 2769 05-04 14:37:21.989 cls 7 len 4860 05-04 14:37:22.000 cls 8 len 5388 05-04 14:37:22.011 cls 9 len 2440 05-04 14:37:22.016 cls 10 len 4722 05-04 14:37:22.016 cls 11 len 3719 05-04 14:37:22.016 cls 12 len 1239 05-04 14:37:22.016 cls 13 len 5075 05-04 14:37:22.016 cls 14 len 444 05-04 14:37:22.016 cls 15 len 348 05-04 14:37:22.016 cls 16 len 188 05-04 14:37:22.017 cls 17 len 575 05-04 14:37:22.017 cls 18 len 2238 05-04 14:37:24.731 [epoch 1], [iter 1 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:26.647 [epoch 1], [iter 2 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:28.570 [epoch 1], [iter 3 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:30.489 [epoch 1], [iter 4 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:32.425 [epoch 1], [iter 5 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:34.368 [epoch 1], [iter 6 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:36.322 [epoch 1], [iter 7 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:38.287 [epoch 1], [iter 8 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:40.165 [epoch 1], [iter 9 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:42.096 [epoch 1], [iter 10 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:44.035 [epoch 1], [iter 11 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:45.961 [epoch 1], [iter 12 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:47.863 [epoch 1], [iter 13 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:49.733 [epoch 1], [iter 14 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:51.593 [epoch 1], [iter 15 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:53.494 [epoch 1], [iter 16 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:55.363 [epoch 1], [iter 17 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:57.282 [epoch 1], [iter 18 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:59.181 [epoch 1], [iter 19 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:01.062 [epoch 1], [iter 20 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:02.919 [epoch 1], [iter 21 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:04.766 [epoch 1], [iter 22 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:06.725 [epoch 1], [iter 23 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:08.590 [epoch 1], [iter 24 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:10.517 [epoch 1], [iter 25 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:12.443 [epoch 1], [iter 26 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:14.354 [epoch 1], [iter 27 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:16.215 [epoch 1], [iter 28 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:18.099 [epoch 1], [iter 29 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:20.038 [epoch 1], [iter 30 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:21.917 [epoch 1], [iter 31 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:23.831 [epoch 1], [iter 32 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:25.707 [epoch 1], [iter 33 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:27.588 [epoch 1], [iter 34 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:29.457 [epoch 1], [iter 35 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:31.316 [epoch 1], [iter 36 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:33.232 [epoch 1], [iter 37 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:35.092 [epoch 1], [iter 38 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:37.005 [epoch 1], [iter 39 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:38.879 [epoch 1], [iter 40 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:40.753 [epoch 1], [iter 41 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:42.667 [epoch 1], [iter 42 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:44.567 [epoch 1], [iter 43 / 744], [train main loss 0.000000], [lr 0.009933]
请问这是梯度消失或者爆炸了吗

@lxtGH
Hi, I did not change your code, i just change the '--nproc_per_node=2' to use my 2 GTX1080Ti to train yoyr sfnet_res101.sh, however when it is in the epoch 0, it is ok, when it is in epoch 1, i got the train main loss 0.000000, NaN, could you please tell me what is problem?
05-04 14:36:03.238 [epoch 0], [iter 736 / 744], [train main loss 0.982473], [lr 0.009967] 05-04 14:36:05.153 [epoch 0], [iter 737 / 744], [train main loss 0.981140], [lr 0.009967] 05-04 14:36:07.020 [epoch 0], [iter 738 / 744], [train main loss 0.979811], [lr 0.009967] 05-04 14:36:08.874 [epoch 0], [iter 739 / 744], [train main loss 0.978485], [lr 0.009967] 05-04 14:36:10.757 [epoch 0], [iter 740 / 744], [train main loss 0.977163], [lr 0.009967] 05-04 14:36:12.671 [epoch 0], [iter 741 / 744], [train main loss 0.975844], [lr 0.009967] 05-04 14:36:14.587 [epoch 0], [iter 742 / 744], [train main loss 0.974529], [lr 0.009967] 05-04 14:36:16.509 [epoch 0], [iter 743 / 744], [train main loss 0.973217], [lr 0.009967] 05-04 14:36:18.422 [epoch 0], [iter 744 / 744], [train main loss 0.971909], [lr 0.009967] /home/dai/code/semantic_segmentation/23/original_train/SFSegNets-master/loss.py:111: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. return self.nll_loss(F.log_softmax(inputs), targets) /home/dai/code/semantic_segmentation/23/original_train/SFSegNets-master/loss.py:111: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument. return self.nll_loss(F.log_softmax(inputs), targets) 05-04 14:36:23.505 validating: 1 / 250 05-04 14:36:28.132 validating: 21 / 250 05-04 14:36:32.666 validating: 41 / 250 05-04 14:36:37.171 validating: 61 / 250 05-04 14:36:41.790 validating: 81 / 250 05-04 14:36:46.273 validating: 101 / 250 05-04 14:36:50.794 validating: 121 / 250 05-04 14:36:55.230 validating: 141 / 250 05-04 14:36:59.742 validating: 161 / 250 05-04 14:37:04.240 validating: 181 / 250 05-04 14:37:08.797 validating: 201 / 250 05-04 14:37:13.382 validating: 221 / 250 05-04 14:37:17.878 validating: 241 / 250 05-04 14:37:19.871 IoU: 05-04 14:37:19.872 label_id label iU Precision Recall TP FP FN 05-04 14:37:19.872 0 0 37.65 1.00 0.38 37.65 0.00 1.66 /home/dai/code/semantic_segmentation/23/original_train/SFSegNets-master/utils/misc.py:275: RuntimeWarning: divide by zero encountered in float_scalars iu_false_positive[idx] / iu_true_positive[idx]) /home/dai/code/semantic_segmentation/23/original_train/SFSegNets-master/utils/misc.py:276: RuntimeWarning: invalid value encountered in float_scalars fn = '{:5.2f}'.format(iu_false_negative[idx] / iu_true_positive[idx]) /home/dai/code/semantic_segmentation/23/original_train/SFSegNets-master/utils/misc.py:280: RuntimeWarning: invalid value encountered in float_scalars iu_true_positive[idx] / (iu_true_positive[idx] + iu_false_negative[idx])) 05-04 14:37:19.886 1 1 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.886 2 2 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.886 3 3 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.886 4 4 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 5 5 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 6 6 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 7 7 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 8 8 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 9 9 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 10 10 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 11 11 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 12 12 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.887 13 13 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 14 14 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 15 15 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 16 16 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 17 17 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 18 18 0.00 0.00 nan 0.00 inf nan 05-04 14:37:19.888 mean 0.019816191866993904 /home/dai/anaconda3/envs/SFSegNets/lib/python3.7/site-packages/torchvision/transforms/transforms.py:210: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead. warnings.warn("The use of the transforms.Scale transform is deprecated, " + 05-04 14:37:21.910 ----------------------------------------------------------------------------------------------------------- 05-04 14:37:21.910 [epoch 0], [val loss nan], [acc 0.37651], [acc_cls 0.05263], [mean_iu 0.01982], [fwavacc 0.14176] 05-04 14:37:21.911 best record: [val loss nan], [acc 0.37651], [acc_cls 0.05263], [mean_iu 0.01982], [fwavacc 0.14176], [epoch 0], 05-04 14:37:21.911 ----------------------------------------------------------------------------------------------------------- 05-04 14:37:21.911 NaN or Inf found in input tensor. 05-04 14:37:21.928 Class Uniform Percentage: 0.5 05-04 14:37:21.934 Class Uniform items per Epoch:2975 05-04 14:37:21.940 cls 0 len 5866 05-04 14:37:21.940 cls 1 len 5184 05-04 14:37:21.940 cls 2 len 5678 05-04 14:37:21.945 cls 3 len 1312 05-04 14:37:21.956 cls 4 len 1723 05-04 14:37:21.967 cls 5 len 5656 05-04 14:37:21.979 cls 6 len 2769 05-04 14:37:21.989 cls 7 len 4860 05-04 14:37:22.000 cls 8 len 5388 05-04 14:37:22.011 cls 9 len 2440 05-04 14:37:22.016 cls 10 len 4722 05-04 14:37:22.016 cls 11 len 3719 05-04 14:37:22.016 cls 12 len 1239 05-04 14:37:22.016 cls 13 len 5075 05-04 14:37:22.016 cls 14 len 444 05-04 14:37:22.016 cls 15 len 348 05-04 14:37:22.016 cls 16 len 188 05-04 14:37:22.017 cls 17 len 575 05-04 14:37:22.017 cls 18 len 2238 05-04 14:37:24.731 [epoch 1], [iter 1 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:26.647 [epoch 1], [iter 2 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:28.570 [epoch 1], [iter 3 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:30.489 [epoch 1], [iter 4 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:32.425 [epoch 1], [iter 5 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:34.368 [epoch 1], [iter 6 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:36.322 [epoch 1], [iter 7 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:38.287 [epoch 1], [iter 8 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:40.165 [epoch 1], [iter 9 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:42.096 [epoch 1], [iter 10 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:44.035 [epoch 1], [iter 11 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:45.961 [epoch 1], [iter 12 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:47.863 [epoch 1], [iter 13 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:49.733 [epoch 1], [iter 14 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:51.593 [epoch 1], [iter 15 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:53.494 [epoch 1], [iter 16 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:55.363 [epoch 1], [iter 17 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:57.282 [epoch 1], [iter 18 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:37:59.181 [epoch 1], [iter 19 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:01.062 [epoch 1], [iter 20 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:02.919 [epoch 1], [iter 21 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:04.766 [epoch 1], [iter 22 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:06.725 [epoch 1], [iter 23 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:08.590 [epoch 1], [iter 24 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:10.517 [epoch 1], [iter 25 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:12.443 [epoch 1], [iter 26 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:14.354 [epoch 1], [iter 27 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:16.215 [epoch 1], [iter 28 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:18.099 [epoch 1], [iter 29 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:20.038 [epoch 1], [iter 30 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:21.917 [epoch 1], [iter 31 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:23.831 [epoch 1], [iter 32 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:25.707 [epoch 1], [iter 33 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:27.588 [epoch 1], [iter 34 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:29.457 [epoch 1], [iter 35 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:31.316 [epoch 1], [iter 36 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:33.232 [epoch 1], [iter 37 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:35.092 [epoch 1], [iter 38 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:37.005 [epoch 1], [iter 39 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:38.879 [epoch 1], [iter 40 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:40.753 [epoch 1], [iter 41 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:42.667 [epoch 1], [iter 42 / 744], [train main loss 0.000000], [lr 0.009933] 05-04 14:38:44.567 [epoch 1], [iter 43 / 744], [train main loss 0.000000], [lr 0.009933]

lxtGH commented

what is your pytorch version?

@lxtGH
I have torch-1.2.0, torchvision-0.4.0, CUDA-10.0

@lxtGH
when i only use one GPU and disable Syncbn, it seems train ok, but when i use 2 GPU, i got the above message: train main loss 0.000000

lxtGH commented

Did you use official config from master branch? which loss you use?

@lxtGH
I donot understand why one GPU , without syncbn is ok, but multi GPUs with syncbn is error

@lxtGH
i directly download code from your github, and i donot change any code

lxtGH commented

I will try to re-train it with 2 GPUs

@lxtGH
Hi, sorry to disturb you that did you have re-trained it with 2 GPUs?

@lxtGH
Hi,Could you please try to train with 2 GPUs,you just train to epoch 10 to see if you have this problem.
thank you.

lxtGH commented

I try to train with 2GPUs There is no bugs for that.