Eromera/erfnet_pytorch

transfer learning with erfnet

ytzhao opened this issue · 8 comments

hi all,

I would like to do a transfer learning project by using erfnet, but i have some questions about the training process with erfnet.

I have collected my own data(training : val : test = 7k : 1.5k : 1.5k images), the dataset has 15 classes, if I would like to train the model without pre-trained ImageNet weights, when do I decide to terminate the encoder training process?

Thank you very much :)

Hi, your project sounds good. You should train the encoder until the loss converges (until it stops decreasing), specially paying attention to the gap between train and val accuracy so it does not get larger (overfitting). You can also try training the full network without pretraining encoder, by using a larger LR for the beginning or by training more epochs.

@Eromera Hi Eromera, thanks for your reply. I've tried your project with my own collected data, but I met some problems when I train the model.
image

It works well with encoder training, but always meets the same problem with decoder. I guess maybe the batch size causes this problem. Therefore, after finishing the encoder part, I train the decoder separately. It converges very slow. For encoder training, the val IoU comes to about 50% at epoch 150, but for the decoder, the val IoU comes to about 7% at epoch 150.

BTW, I don't know what the "step" ("tra_loss : 2.966 (epoch: 1, step: 0)") means and how it works in the training process. Could you explain more? Thank you :)

Hi @ytzhao, "out of memory" problem is normally related to the batch not fitting your gpu memory, have you tried with smaller batch sizes?

It does not make sense that the decoder only reaches 7%, can you provide more info? are your labels ok? are there any classes that are much better than others?

The step is a forward+backward pass on the network with a batch, steps*batch_size = dataset size (1 full epoch). At the start of each epoch, the info for step=0 is shown for logging purposes.

Hi @Eromera, Thanks for your reply.
For training the decoder separately, I don't want to use the pretrained ImageNet model. After training the encoder part, I get some best pretrained encoder model file (model_best_enc.pth.tar). Therefore, I would like to train the decoder part after this breakpoint. I modied the decoder loading part of your code. Like this below:

snapshot18

I'm not sure this is a good to continue to train decoder.

Hi @ytzhao, to train decoder after training encoder you shouldn't need to change the code, you just need to use the two flags --decoder and --state pointing to the "model_best_enc.pth.tar". You can check this post where this was mentioned.

Hi @Eromera , when I train the decoder, it always an error occurred at the loss.backward(),I have search it on internet but I just train it at a GPU, it's may not because of the multi_GPU ,the error is RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

can you tell how to handle this
tim 20181005220352
ti

@Chenfeng1271 what‘s your cuda version and pytorch version?

Hi @Chenfeng1271 , I think that error is related to a mismatch in the number of classes between the network output and the labeled ones