A strange problem encountered during training
monk42 opened this issue · 15 comments
I redesigned the network architecture by using the EESP module, and used the open source code method to train the network, but there was a strange phenomenon. The miou value of the training set is constantly increasing, but the miou value of the verification set is like a Random value, the value will remain unchanged at the end of the network training and is very small !
Did you process Cityscapes dataset correctly? Seems like an issue with labels.
Yes,I did ! 255 converted to 19 !
I mean Cityscapes labels are not continuous between 0 and 19. You need to first convert them to continuous values. Have you done this?
Yes ,i have used this code https://github.com/mcordts/cityscapesScripts/blob/master/cityscapesscripts/preparation/createTrainIdLabelImgs.py to conver them!
Could you just cross verify label images for the validation set are correct?
Which PyTorch version are you using?
could you please also check if this observation is the same if you use training set as your validation set?
Pytorch version 1.0.1 post2, i will try you methods ! thanks
This observation is the same if I use training set as my validation set
and my validation set is correct ,Throughout the code, I only changed the two loss values needed for training to only one loss value.
Could you point me to your repo?
output1, output2 = model(input) #set the grad to zero optimizer.zero_grad() loss1 = criterion(output1, target) loss2 = criterion(output2, target) loss = loss1 + loss2
change to
output1=model(input) optimizer.zero_grad() loss=criterion(output1, target)
No change in other places
sorry ,I didn't create a repo
Well you need to change the model file too, so that you get only one output instead of two.
I used the EESP module in your code to redesign the network structure and only have one output, then modify the training code to what I told you, the other parts have not changed, the data set is no problem, you also see the network can Training, but there was a problem with the verification, but when I used this code to train other network structures, everything was fine! I don't know what caused it.
Could you use the original code (without your changes) and see if it works?
If that works, then try incorporating your changes one by one. This will help you to debug the error.
Thanks ,I will do it!