Have your tried configuration where teacher and student are totally differet, such as densenet and resnet, or densenet and mobilenet?
leoozy opened this issue · 1 comments
I didn't try it for densenet.
Densenet has long skip connection from low-level feature and I think this long skip-connection is hard to be represented by pure resnet.
So, it is very difficult case for feature distillation.
But, I can give you advice.
If teacher and student are totally different, decrease the ratio of distillation loss.
Most extreme case I've tried is ResNet50-V1D (teacher) to MobileNetV2 (student).
In this case, I achieved the best performance with x0.01 ratio of distillation loss.
In other words, change this line
such as
loss = loss_CE + loss_distill.sum() / batch_size / 10000 / 100
Performance table according to distillation ratio is here.
Distill-ratio | Error (top-1) |
---|---|
x0 (wo distill) | 28.13 |
x1 | 30.22 |
x0.1 | 28.35 |
x0.01 | 27.52 |
x0.001 | 27.72 |
So, if distillation doesn't work for your config and teacher and student are in different architecture, try low ratio of distillation loss.
This is not a good solution, but works well for many cases.