sacmehta/ESPNetv2

A strange problem encountered during training

monk42 opened this issue · 15 comments

I redesigned the network architecture by using the EESP module, and used the open source code method to train the network, but there was a strange phenomenon. The miou value of the training set is constantly increasing, but the miou value of the verification set is like a Random value, the value will remain unchanged at the end of the network training and is very small !
36bb37897c2ecce5b785d4b9a9c3b87
352d6e729ad162eecb0585321909ee3

Did you process Cityscapes dataset correctly? Seems like an issue with labels.

Yes,I did ! 255 converted to 19 !

After 6 epoch!
67530ab5b8a19317e8e85f369bf28e0

I mean Cityscapes labels are not continuous between 0 and 19. You need to first convert them to continuous values. Have you done this?

Could you just cross verify label images for the validation set are correct?

Which PyTorch version are you using?

could you please also check if this observation is the same if you use training set as your validation set?

Pytorch version 1.0.1 post2, i will try you methods ! thanks

This observation is the same if I use training set as my validation set
and my validation set is correct ,Throughout the code, I only changed the two loss values needed for training to only one loss value.

Could you point me to your repo?

output1, output2 = model(input) #set the grad to zero optimizer.zero_grad() loss1 = criterion(output1, target) loss2 = criterion(output2, target) loss = loss1 + loss2

change to

output1=model(input) optimizer.zero_grad() loss=criterion(output1, target)
No change in other places
sorry ,I didn't create a repo

Well you need to change the model file too, so that you get only one output instead of two.

I used the EESP module in your code to redesign the network structure and only have one output, then modify the training code to what I told you, the other parts have not changed, the data set is no problem, you also see the network can Training, but there was a problem with the verification, but when I used this code to train other network structures, everything was fine! I don't know what caused it.

Could you use the original code (without your changes) and see if it works?

If that works, then try incorporating your changes one by one. This will help you to debug the error.

Thanks ,I will do it!