clovaai/overhaul-distillation

Have your tried configuration where teacher and student are totally differet, such as densenet and resnet, or densenet and mobilenet?

leoozy opened this issue · 1 comments

Have your tried configuration where teacher and student are totally differet, such as densenet and resnet, or densenet and mobilenet?
bhheo commented

I didn't try it for densenet.
Densenet has long skip connection from low-level feature and I think this long skip-connection is hard to be represented by pure resnet.
So, it is very difficult case for feature distillation.
But, I can give you advice.
If teacher and student are totally different, decrease the ratio of distillation loss.

Most extreme case I've tried is ResNet50-V1D (teacher) to MobileNetV2 (student).
In this case, I achieved the best performance with x0.01 ratio of distillation loss.

In other words, change this line

loss = loss_CE + loss_distill.sum() / batch_size / 10000

such as

 loss = loss_CE + loss_distill.sum() / batch_size / 10000 / 100

Performance table according to distillation ratio is here.

Distill-ratio Error (top-1)
x0 (wo distill) 28.13
x1 30.22
x0.1 28.35
x0.01 27.52
x0.001 27.72

So, if distillation doesn't work for your config and teacher and student are in different architecture, try low ratio of distillation loss.
This is not a good solution, but works well for many cases.