amaralibey/MixVPR

When I use resnet18, I should manually modify the in of MixVPR in the source code_ Channels and out_ Channels?

wpumain opened this issue · 7 comments

When I use resnet18, the source code has automatically checked out in the ResNet class_ Modify the value of channels

self.out_channels = out_channels // 2 if self.model.layer4 is None else out_channels

However, there is no code in MixVPR_ Channels and out_ Channels are automatically modified. Should I modify them manually at this time?

self.channel_proj = nn.Linear(in_channels, out_channels)

self.channel_proj = nn.Linear(in_channels, out_channels)
self.row_proj = nn.Linear(hw, out_rows)

I should change the value of in_channels to 256, out_ channels=hw=400,out_ Rows remains unchanged at 4,right?

out_ channels=hw=400,Is it wrong to set the value of out_ channels?
Because:

x = x.permute(0, 2, 1)

Hello @wpumain,
The backbone and aggregator components operate independently. Within the ResNet backbone, modifications to the out_channels are made based on the specified layers_to_crop parameter. When the last residual block is cropped, the channel dimension is halved (in ResNet architecture, there are always 2x channels in the following block and 2x less in the previous block).

The codebase has been structured such that neither the resnet.py nor mixvpr.py files need to be altered. Both modules can be easily configured from the main.py file.

When employing ResNet18 and cropping the final residual block as suggested in MixVPR, the resulting feature maps will have 256 channels. Additionally, if the input images is 320x320, the feature maps' spatial dimensions will be 20x20. The overall output of the cropped ResNet18 will be 256x20x20.

When creating a MixVPR instance, you specify the in_channels as 256, in_h as 20, and in_w as 20. These parameters define the input feature map dimensions for MixVPR. The remaining parameters configure the MixVPR architecture, such as mix_depth, which sets the number of Mixer blocks to use, and out_channels, which determines the dimensionality reduction of the feature maps from in_channels to out_channels (in this case, from 256 to xxx). Additionally, out_rows specifies the projection of the flattened feature map's rows, which are in this case 20x20=400 in size. If you take a look at the figure of MixVPR architecture, things will get clearer.

The code in main.py will look like this:

model = VPRModel(
        #---- Encoder
        backbone_arch='resnet18',
        pretrained=True,
        layers_to_freeze=2,
        layers_to_crop=[4], # 4 crops the last resnet layer, 3 crops the 3rd, ...etc
        
        #---- Aggregator
        agg_arch='MixVPR',
        agg_config={
                'in_channels' : 256, # nb of channels in the MixVPR input feature maps
                'in_h' : 20, # height of the input feature maps
                'in_w' : 20, # width of the input feature maps
                'mix_depth' : 4,
                'mlp_ratio' : 1,
                'out_channels' : 128, # the channel wise reduction (could be any other value)
                'out_rows' : 4}, # the output dim will be (out_rows * out_channels)
                
        ...

It's worth noting that if you wish to use smaller input images, say 224x224, you can adjust the in_h and in_w parameters in MixVPR accordingly. Remember that when cropping ResNet at the 4th layer, the spatial dimensions are always reduced by a factor of 16. As a result, the feature maps produced for a 224x224 image will have a spatial dimension of 14x14. This value should be specified in MixVPR by setting in_h=14 and in_w=14.

Please let me know if there is anything that remains unclear.

Think you for your help very much.

self.out_channels = self.out_channels // 2 if self.model.layer3 is None else self.out_channels

The self.out_channels in ResNet must be equal to the in_channels in MixVPR, but the above code line doesn't seem to work. I think maybe we should pass self.out_channels in ResNet to in_channels in agg_config so that we don't have to manually modify in_channels in agg_config.

'in_channels' : 256, # nb of channels in the MixVPR input feature maps

What does nb stand for?

Think you for your help very much.

self.out_channels = self.out_channels // 2 if self.model.layer3 is None else self.out_channels

The self.out_channels in ResNet must be equal to the in_channels in MixVPR, but the above code line doesn't seem to work. I think maybe we should pass self.out_channels in ResNet to in_channels in agg_config so that we don't have to manually modify in_channels in agg_config.

It's the other way around, we are first taking a ResNet backbone, and feeding its output to MixVPR. It's always to MixVPR to adapt to the output of the Backbone, we don't need to modify the code in resnet.

So, it's in_channels of MixVPR that must be equal to out_channels of ResNet.

The code you're referring to ensure that if we crop ResNet at the 3rd layerm, then the number of channels must be divided by 4, if we crop at the 4th layer then it must be divided by 2, otherwise it's kept at 2048 for ResNet-50-101-152 and 512 for ResNet18-34.

The reason in_channels must be manually fixed is a choice we made for this framework to train any aggregation technique (CosPlace, NetVLAD, GeM, ...etc). Each of these techniques use a different name for term for in_channels and we didn't want to alter their code to rename the parameter. If you want to make this automatic, a better way to do it is to add a parameter named in_channels into the helpers.get_aggregator() method, and make changes in the function to take it into account when calling the aggregator.

I will add these changes in the next sync.

'in_channels' : 256, # nb of channels in the MixVPR input feature maps

What does nb stand for?

nb means number, so: the number of channels in the MixVPR input feature maps

Thank you very much for your detailed guidance