ziplab/LITv2

Question about the code difference between the classification and segmentation

Leiyi-Hu opened this issue · 6 comments

Hi, Thanks for your excellent work, it's really elegant to accelerate the ViT. I have some questions about the code (the backbone part) of classification and segmentation, it seems that they are a little different from each other. so, do they really have different designs or just different implementations of the same model?
Looking forward to your reply!

Hi Leiyi,

Thanks for your interest! In short, they are the same model (same design) but with minor adaptions for different tasks.

For example, on image classification, the input image resolution is usually fixed and we only care about the output logits. However, on dense prediction tasks such as segmentation, the code has been updated in order to deal with different image resolutions. Besides, we add some normalization layers in the backbone (see here) to further process the pyramid feature maps. Hope this can answer your question.

Best,
Zizheng

Hi Zizheng,
Thanks for your reply, you said that you update the code in order to deal with different image resolutions, but I just found that maybe you fix the input as 512 just for ade20k like the following:

# absolute position embedding
        input_resolution = [
            512 // patch_size,
            512 // patch_size,
        ]
``` in segmentation/mmseg/models/backbones/litv2.py
<img width="382" alt="image" src="https://user-images.githubusercontent.com/66046309/207588136-baab6e62-3b71-48c4-af07-d5e9522fb600.png">
and, I want to make sure whether only the normalization layers have been changed, or if there are any other changes.

Best regardsLeiyi

Hi Leiyi, note that we do not use absolute position embedding in LITv2, so please just ignore this line of code. Besides, for semantic segmentation on ADE20K, it is a common practice to set the training image resolution as 512x512. However, for object detection on COCO, the input resolution is not fixed due to data augmentation. There is no other changes in the architecture since we need to load the pretrained weights from ImageNet training. Otherwise there will be a mismatch between architectures.

Best,
Zizheng

Hi Zizheng, Thanks a lot! Maybe I'll change the resolution according to my dataset, and, another question, you mentioned that the smaller param alpha for semantic segmentation may be better, so if the value of alpha changed, while adjusting the local window size s, will these modifications affect the loading of weights? or If I want to change these params, I have to pretrain the model from scratch? ( according to the code, when alpha is fixed, the linear proj for LoFi and HiFi are determined)
best,
Leiyi

Good question!

  1. Change the local window size will not affect the model parameters since average pooling and window partition do not require learnable parameters, thus you can still load the previous ImageNet pretrained weights.
  2. Change the value of alpha will allocate different number of heads to Hi-Fi and Lo-Fi branches in the proposed HiLo attention, which will result in a different computational graph. In this case, you will need to retrain the model on ImageNet from scratch with this new alpha.

To help further research, we just uploaded ImageNet pretrained LITv2-S with different choices of alpha. You can find these checkpoints here.

Cheers,
Zizheng

Thank you very much! Your reply is very detailed and helpful!
Best,
Leiyi