SHI-Labs/Neighborhood-Attention-Transformer

training from scratch with different size for height and width

mr17m opened this issue · 3 comments

mr17m commented

Hello,
I am interested in training one of your models from scratch on ImageNet1K with the different input size of e.g. (352,448) for a segmentation task but I have no idea how I should insert in your arguments such a size. Specifically, in this line only one spatial size is requested and it shows that the default configuration is for the square size input. How I can insert the size for height and width?
Plus, could you please let me know the difference between the '--img-size' and '--input-size' arguments.

On the other hand, as I mentioned my plan is to train one of your models (NAT/DiNAT) on the ImageNet1K from scratch and afterward using the pretrained weights in my dense prediction task. Do you think it is reasonable and wise idea? if not, could you let me know what is your suggestion?

Please let me know what you think.
Thank you

Thanks for your interest.

You can refer to our config files under classification/configs, for instance the fine-tuning config at 384x384.
If you intend to train non-square images, please use --input-size (that is actually the difference, this argument lets you specify the tensor shape, including number of channels, instead of just a single integer.)

However, I cannot speak to training ImageNet at that resolution. Common practice is training ImageNet with square crops or reshapes, and then fine-tuning at your desired resolution. If you refer to our segmentation experiments, all pre-trained models were either trained at 224x224 or 384x384, but segmentation resolution is not necessarily square, and rarely that small.

The only thing that I'd encourage you to try is adjusting dilation values (if using DiNAT models). We have a whole section on that in the paper, where we show the effect of adjusting dilation values when transferring the model to a new task, especially when it has a larger resolution.

The gist of it is, all of our models follow the convention in processing feature maps that are 1/4th, 1/8th, 1/16th, and 1/32nd of the original image resolution. On ImageNet at 224x224, that means feature maps are 56x56, 28x28, 14x14, and 7x7 in the 4 levels of each variant, regardless of the variant size. Given that we set the Neighborhood Attention kernel size to 7x7, this means that the first level can be dilated up to 8 (because 8 x 7 <= 56), the second can be dilated up to 4, the third up to 2, and the final level cannot be dilated (dilation = 1).

In downstream tasks however, we adjust the dilation values per-level, and sometimes even per-layer, because: A. more dilation results in more global context, B. gradual changes in dilation value grow the receptive field more quickly.

As a reference, I'd highly encourage checking out both our semantic segmentation experiments with UperNet (under segmentation/), and our instance, semantic, and panoptic segmentation experiments with Mask2Former (under mask2former/).

Let me know if you have more questions.

mr17m commented

Thank you so much for your comprehensive answer.

Closing this due to inactivity.