Training stuck at 28%
Closed this issue · 6 comments
Thanks for your open source code, when I train with the following command:
python train.py --config yamls/adashare/nyu_v2_2task.yml --gpus 0
python re-train.py --config yamls/adashare/nyu_v2_2task.yml --gpus 0 --exp_ids 0
My terminal always outputs the following result and it ends.
28%|████████████████████████████████████████████████████▌ | 11/40 [01:16<03:21, 6.96s/it]
28%|██████████████████████████████████████████████████▎ | 11/40 [13:15:37<34:57:33, 4339.76s/it]
28%|██████████████████████████████████████████████████▎ | 11/40 [15:05:14<39:46:32, 4937.67s/it]
28%|███████████████████████████████████████████▍
Is it normal?
I have encountered several times of stucks. But all of them turn out to be the machine-specific problem. My previous solution is to change the machine and run it again. But if changing the machine is not plausible at your end, you could try to change the random seed or check the dataloading. My guess is that some data is broken and unable to load.
Thanks for your answer.
After multi times of training, I found that the code usually stuck at Evaluating stage. Is there something I can do to fix the bug? It seems that in the training phase, everything is OK, but when it comes to Evaluating, errors occur.
##################################################
BlockDropEnv
define the scheduler (not policy learning)
define the scheduler (not policy learning)
Evaluating...
28%|█████████████████████████████████████▋ | 11/40 [01:08<03:00, 6.24s/it]-------------------------------------------------------------
seg:
update: 48, time: 73.878 mIoU: 0.170 Pixel Acc: 0.291 err: 2.905
sn:
update: 48, time: 73.880 cosine_similarity: 0.928 Angle Mean: 19.964 Angle Median: 17.824 Angle RMSE: 22.079 Angle 11.25: 10.847 Angle 22.5: 72.160 Angle 30: 87.183 Angle 45: 97.031
Change temperature from 5.00000 to 4.82500
[[0.5 0.5 ]
[0.5 0.5 ]
[0.5 0.5 ]
[0.5 0.5 ]
[0.5 0.5 ]
[0.5 0.5 ]
[0.5 0.5 ]
[0.46412945 0.53587055]]
Evaluating...
------------------------------------------------------------- | 11/40 [01:10<03:03, 6.32s/it]
seg:
update: 96, time: 75.113 mIoU: 0.289 Pixel Acc: 0.315 err: 2.858
sn:
update: 96, time: 75.115 cosine_similarity: 0.928 Angle Mean: 19.907 Angle Median: 17.975 Angle RMSE: 22.030 Angle 11.25: 12.121 Angle 22.5: 70.209 Angle 30: 87.325 Angle 45: 97.382
-------------------------------------------------------------
seg:
update: 100, time: 3.789 total: 2.689
sn:
update: 100, time: 3.792 total: 0.066
total:
update: 100, time: 3.793 total: 4.064
hamming:
update: 100, time: 3.795 total: 0.010
sparsity:
update: 100, time: 3.796 total: 1.205 task1: 0.605 task2: 0.600
28%|█████████████████████████████████████▋ | 11/40 [05:51<15:27, 31.99s/it]
28%|█████████████████████████████████████▋ | 11/40 [01:31<04:02, 8.35s/it]
Error solved after using a different machine.
Error solved after using a different machine.
Hi, I meet the same problem. I wonder what caused this problem.
Do you have other solutions except changing the machine?