KD with Resnet and Mobinetv3

Hi,
thanks for the wonderfull research works.

currently, i would like to integrate the use of student net (mobinetv3) and teacher net (Resnet), if your implementation def get_bn_before_relu(self), would it be the same to use it before h_swish in mobinet3?

and also in the extract feature, part in preRelu, would it be the same as well pre-h_swish, for the implementation in mobinet3?

thank you.

Hi
Thank you for your interest in my work

I never tried KD on MobileNetV3, so, I'm not 100% sure.
But, if you uses MobileNet v3 as the student, I think current implementation will be fine.

In my work, activation, margin and preReLU are only related with the teacher network.
In case of student, the connector layers make preReLU features based on BN.
So, you don't need to care about details on student's representation.

You can find the connector layer here.

overhaul-distillation/ImageNet/distiller.py

Lines 15 to 27 in 76344a8

    
           def build_feature_connector(t_channel, s_channel): 
        
               C = [nn.Conv2d(s_channel, t_channel, kernel_size=1, stride=1, padding=0, bias=False), 
        
                    nn.BatchNorm2d(t_channel)] 
        
               for m in C: 
        
                   if isinstance(m, nn.Conv2d): 
        
                       n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels 
        
                       m.weight.data.normal_(0, math.sqrt(2. / n)) 
        
                   elif isinstance(m, nn.BatchNorm2d): 
        
                       m.weight.data.fill_(1) 
        
                       m.bias.data.zero_() 
        
               return nn.Sequential(*C)

In short, if you only changes student, there will be no problem, I think.
However, performance improvement is related with a lot of issues and training techniques.
So, I'm not sure it can increase performance on MobileNetV3 since it adopt many recent regularization and training techniques.

Dear @bhheo,
yes, sure. thanks for the reply.
i found some mistakes during integration with the mobinetv3.

also i am curious one more thing,
in you get_bn_before_relu, in features 4, why did you take BN after relu?

def get_bn_before_relu(self):
        bn1 = self.features[4].conv[1]
        bn2 = self.features[7].conv[1]
        bn3 = self.features[14].conv[1]
        bn4 = self.features[-1].conv[-1]

        return [bn1, bn2, bn3, bn4]

if you take a look at bn4, it is not BN before relu.
that is supposed to be in point number 7. Do you have any explaination about that?
thank you

(17): InvertedResidual(
      (conv): Sequential(
        (0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU6()
        (3): Conv2d(960, 960, kernel_size=(3, 3), stride=(1, 1), groups=960, bias=False)
        (4): BatchNorm2d(960, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (5): ReLU6()
        (6): Conv2d(960, 320, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (7): BatchNorm2d(320, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )

Dear @hamhanry

That is my mistake.
I intended to use BN at the final conv layer of ImageNet version.
(Such as https://github.com/tonylins/pytorch-mobilenet-v2/blob/99f213657e97de463c11c9e0eaca3bda598e8b3f/MobileNetV2.py#L99)
However, it was removed for semantic segmentation. So, the BN position features[-1].conv[-1] indicates other BN and there are no RELU.
Since MobileNetV2 was never used as the teacher in my paper, I didn't pay attention on that function and made a mistake.
Sorry for the confusion.

Hi @bhheo ,
thanks for your response.

actually i am using mobinetv3 as a student but curious those conv.
if any info needed, will let you know 👍
many thanks

	def build_feature_connector(t_channel, s_channel):
	C = [nn.Conv2d(s_channel, t_channel, kernel_size=1, stride=1, padding=0, bias=False),
	nn.BatchNorm2d(t_channel)]

	for m in C:
	if isinstance(m, nn.Conv2d):
	n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
	m.weight.data.normal_(0, math.sqrt(2. / n))
	elif isinstance(m, nn.BatchNorm2d):
	m.weight.data.fill_(1)
	m.bias.data.zero_()

	return nn.Sequential(*C)