Have your tried configuration where teacher and student are totally differet, such as densenet and resnet, or densenet and mobilenet?

I didn't try it for densenet.
Densenet has long skip connection from low-level feature and I think this long skip-connection is hard to be represented by pure resnet.
So, it is very difficult case for feature distillation.
But, I can give you advice.
If teacher and student are totally different, decrease the ratio of distillation loss.

Most extreme case I've tried is ResNet50-V1D (teacher) to MobileNetV2 (student).
In this case, I achieved the best performance with x0.01 ratio of distillation loss.

In other words, change this line

overhaul-distillation/ImageNet/train_with_distillation.py

Line 179 in 392a006

loss = loss_CE + loss_distill.sum() / batch_size / 10000

such as

 loss = loss_CE + loss_distill.sum() / batch_size / 10000 / 100

Performance table according to distillation ratio is here.

Distill-ratio	Error (top-1)
x0 (wo distill)	28.13
x1	30.22
x0.1	28.35
x0.01	27.52
x0.001	27.72

So, if distillation doesn't work for your config and teacher and student are in different architecture, try low ratio of distillation loss.
This is not a good solution, but works well for many cases.