Why only update on train_op2 but not train_op?
snownus opened this issue · 4 comments
Why only update train_op2 in the below codes? I have tried this version, it seems doesn't work. @sseung0703
KD_methods_with_TF/train_w_distill.py
Line 132 in 89d2b1f
train_op2 is for initializing student network or some auxiliary network for distilling the knowledge.
@sseung0703 Hello, thanks for your work!
Could you please clarify how do distillation methods in your code work? For example, let's look on the FitNet approach (it computes gradients with Optimizer_w_Initializer and returns train_op and train_op2). Am I right that distillation loss alone would be used during initialization epochs (40 in your implementation), and after that distillation loss is turned off and training is based only on classification loss?
Yes, exactly right. This process is similar with pre-training and fine-tuning.
@sseung0703 Ok, thank you! But if so, I have one more question.
As far as I understand, FitNet-like distillation assumes you put MSE loss on hidden feature maps, not the output classification layer. So, in your setting during pre-training you train all layers except for the lass fully connected, and during fine-tuning stage the whole network is trained.
In your code I can see, that the fine-tuning stage starts with the same learning rate (0.1) as pre-training, so considering the fact that classification layer was not pre-trained, it seems to me that pretrained layers would suffer and mostly lose their good initialization.
Maybe it's better to also pretrain the classification layer with MSE-loss? Or start fine-tuning with freezed pretrained layers?
Or maybe I'm missing some details, and if so I would appreciate if you point me on it :)
Thank you!