KD with Resnet and Mobinetv3
hamhanry opened this issue · 4 comments
Hi,
thanks for the wonderfull research works.
currently, i would like to integrate the use of student net (mobinetv3) and teacher net (Resnet), if your implementation def get_bn_before_relu(self), would it be the same to use it before h_swish in mobinet3?
and also in the extract feature, part in preRelu, would it be the same as well pre-h_swish, for the implementation in mobinet3?
thank you.
Hi
Thank you for your interest in my work
I never tried KD on MobileNetV3, so, I'm not 100% sure.
But, if you uses MobileNet v3 as the student, I think current implementation will be fine.
In my work, activation, margin and preReLU are only related with the teacher network.
In case of student, the connector layers make preReLU features based on BN.
So, you don't need to care about details on student's representation.
You can find the connector layer here.
overhaul-distillation/ImageNet/distiller.py
Lines 15 to 27 in 76344a8
In short, if you only changes student, there will be no problem, I think.
However, performance improvement is related with a lot of issues and training techniques.
So, I'm not sure it can increase performance on MobileNetV3 since it adopt many recent regularization and training techniques.
Dear @bhheo,
yes, sure. thanks for the reply.
i found some mistakes during integration with the mobinetv3.
also i am curious one more thing,
in you get_bn_before_relu, in features 4, why did you take BN after relu?
def get_bn_before_relu(self):
bn1 = self.features[4].conv[1]
bn2 = self.features[7].conv[1]
bn3 = self.features[14].conv[1]
bn4 = self.features[-1].conv[-1]
return [bn1, bn2, bn3, bn4]
if you take a look at bn4, it is not BN before relu.
that is supposed to be in point number 7. Do you have any explaination about that?
thank you
(17): InvertedResidual(
(conv): Sequential(
(0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU6()
(3): Conv2d(960, 960, kernel_size=(3, 3), stride=(1, 1), groups=960, bias=False)
(4): BatchNorm2d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU6()
(6): Conv2d(960, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
(7): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
Dear @hamhanry
That is my mistake.
I intended to use BN at the final conv layer of ImageNet version.
(Such as https://github.com/tonylins/pytorch-mobilenet-v2/blob/99f213657e97de463c11c9e0eaca3bda598e8b3f/MobileNetV2.py#L99)
However, it was removed for semantic segmentation. So, the BN position features[-1].conv[-1]
indicates other BN and there are no RELU.
Since MobileNetV2 was never used as the teacher in my paper, I didn't pay attention on that function and made a mistake.
Sorry for the confusion.