transfer learning with erfnet

Question

transfer learning with erfnet

ytzhao opened this issue 6 years ago · 8 comments

hi all,

I would like to do a transfer learning project by using erfnet, but i have some questions about the training process with erfnet.

I have collected my own data(training : val : test = 7k : 1.5k : 1.5k images), the dataset has 15 classes, if I would like to train the model without pre-trained ImageNet weights, when do I decide to terminate the encoder training process?

Thank you very much :)

Answer 1 · 2018-09-04T08:02:02.000Z

Hi, your project sounds good. You should train the encoder until the loss converges (until it stops decreasing), specially paying attention to the gap between train and val accuracy so it does not get larger (overfitting). You can also try training the full network without pretraining encoder, by using a larger LR for the beginning or by training more epochs.

Answer 2 · 2018-09-11T09:29:02.000Z

@Eromera Hi Eromera, thanks for your reply. I've tried your project with my own collected data, but I met some problems when I train the model.

It works well with encoder training, but always meets the same problem with decoder. I guess maybe the batch size causes this problem. Therefore, after finishing the encoder part, I train the decoder separately. It converges very slow. For encoder training, the val IoU comes to about 50% at epoch 150, but for the decoder, the val IoU comes to about 7% at epoch 150.

BTW, I don't know what the "step" ("tra_loss : 2.966 (epoch: 1, step: 0)") means and how it works in the training process. Could you explain more? Thank you :)

Answer 3 · 2018-09-12T09:29:25.000Z

Hi @ytzhao, "out of memory" problem is normally related to the batch not fitting your gpu memory, have you tried with smaller batch sizes?

It does not make sense that the decoder only reaches 7%, can you provide more info? are your labels ok? are there any classes that are much better than others?

The step is a forward+backward pass on the network with a batch, steps*batch_size = dataset size (1 full epoch). At the start of each epoch, the info for step=0 is shown for logging purposes.

Answer 4 · 2018-09-13T10:56:35.000Z

Hi @Eromera, Thanks for your reply.
For training the decoder separately, I don't want to use the pretrained ImageNet model. After training the encoder part, I get some best pretrained encoder model file (model_best_enc.pth.tar). Therefore, I would like to train the decoder part after this breakpoint. I modied the decoder loading part of your code. Like this below:

I'm not sure this is a good to continue to train decoder.

Answer 5 · 2018-09-14T08:56:31.000Z

Hi @ytzhao, to train decoder after training encoder you shouldn't need to change the code, you just need to use the two flags --decoder and --state pointing to the "model_best_enc.pth.tar". You can check this post where this was mentioned.

Answer 6 · 2018-10-05T14:05:40.000Z

Hi @Eromera , when I train the decoder, it always an error occurred at the loss.backward(),I have search it on internet but I just train it at a GPU, it's may not because of the multi_GPU ,the error is RuntimeError: CUDNN_STATUS_INTERNAL_ERROR

can you tell how to handle this

Answer 7 · 2018-10-24T07:18:00.000Z

@Chenfeng1271 what‘s your cuda version and pytorch version?

Answer 8 · 2018-10-24T08:35:38.000Z

Hi @Chenfeng1271 , I think that error is related to a mismatch in the number of classes between the network output and the labeled ones