berniwal/swin-transformer-pytorch

about widow-size

huixiancheng opened this issue · 9 comments

Dear Sir, Thank you very much for your great work. I would like to ask if you have any suggestions on how to set the window size.
For 224x224 input, window size set to 7 is reasonable because it can divide by 7, but for other sizes, such as 768x768 in cityscapes, 7 will undoubtedly report an error since 768 / 32=24 , so it looks like the window setting is very subtle.
The close value is 8, but is the window setting the same as the convolution kernel, where odd numbers work better?
Also, is it possible to set different window sizes at different stages, which seems to be feasible for non-regular image sizes.
Since the window size is a very critical hyperparameter that determines the perceptual field and the amount of computation, would like to request your opinion, thanks!

Thank you very much for your interest. I think as well that the window size is a very interesting hyperparameter to play with as it greatly influences the receptive field of the model. I assume that they chose 7 because they can achieve again global self attention in the last layers (as 224/32 = 7) and in general as attentional models are assumed to perform something similar as convolutions in the first layers and therefore attend more locally there and get more global later (some interesting reads: https://arxiv.org/pdf/2010.11929.pdf, https://arxiv.org/pdf/1911.03584.pdf, https://arxiv.org/pdf/2103.10697v1.pdf).

I don't think that in this case odd numbers are superior to even numbers as for convolutions it is essential to have the symmetric region around each target pixel but here you don't have some specific target pixel defined but have multiple target pixels at once which also don't have a symmetrical region around them for the odd number window setting they use (except for the middle pixel). So I assume you can easily use an even number setting and get similar results.

Different window sizes for different layers are certainly possible as long as each side of the feature input is divisible by that window size and I think it is certainly worth trying different settings and play with the receptive field. For my experiments so far (for weather forecasting) I even get slightly better results when using a smaller window size but still large enough to achieve global receptive field over multiple layers.

I hope this answers your questions and wish you great success with your further experiments.

thank you for your quick answer.I will take a try.
QQ截图20210412095942
Notice that in paper, the author use LayerNorm after patch splitting, looke like in your code don't use it(Not sure if correct).
The front and back position of the LN can be set differently due to the reshaping operation. how about the different between LN of (B, C ,H, W) and LN of (B, (H*W), C)? I have tested LN after reshaping and reshaping after LN, it seems to have a slight effect on the experimental results in the downstream sem.seg task(Not rigorous ablation experiments).

thank you for your quick answer.I will take a try.
QQ截图20210412095942

Notice that in paper, the author use LayerNorm after patch splitting, looke like in your code don't use it(Not sure if correct).
The front and back position of the LN can be set differently due to the reshaping operation. how about the different between LN of (B, C ,H, W) and LN of (B, (H*W), C)? I have tested LN after reshaping and reshaping after LN, it seems to have a slight effect on the experimental results in the downstream sem.seg task(Not rigorous ablation experiments).

大神你好,怎么把这个用到语义分割啊,可以加你qq交流一下吗

The Layernorm should be defined here:
Selection_394
and is applied directly before the Attention:
Selection_396
which should directly follow the patch splitting.

In general I think for LayerNorm where you apply the normalization over the three dimensions channel C, height H, width W it should not make a difference between (B, C ,H, W) and (B, (H*W), C), because the values for the last three dimensions in the first case or two dimensions in the latter are exactly the same and therefore the mean and std are as well.

However for the implementation I adopted from https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit.py I assume it would, as we are giving only the hidden dimension as the dimension which should be normalized over and if given just the last dimension you would apply it in your example for (B, C ,H, W) just over the width dimension W and otherwise for (B, (H*W), C) over the channel dimension C (See: https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html). However for the implementation it seems to make not much of a difference as we are still normalizing over the channel dimension (but over different pixels (H*W) independently) and this implementation seems to widely adopted (https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py).

Thank you very much for your answers!

thank you for your quick answer.I will take a try.
QQ截图20210412095942
Notice that in paper, the author use LayerNorm after patch splitting, looke like in your code don't use it(Not sure if correct).
The front and back position of the LN can be set differently due to the reshaping operation. how about the different between LN of (B, C ,H, W) and LN of (B, (H*W), C)? I have tested LN after reshaping and reshaping after LN, it seems to have a slight effect on the experimental results in the downstream sem.seg task(Not rigorous ablation experiments).

大神你好,怎么把这个用到语义分割啊,可以加你qq交流一下吗

Just follow SETR

HI!Dear Sir!
Do you think the corresponding window size should be set for input with extremely unbalanced length and width?
For example, should the windowsize of input-img_size 224*1792 be set to [7, 7x8]? Can you give me some reference on the code? thank you very much.

Hey! For non-square window size the code is not suitable right now. You would have to change some things. First you would need to adjust the window_size parameter for example with a list with two parameters for the x and y-direction. Then you must adjust this parameter also in the WindowAttention where you need to adjust each window_size parameter which corresponds to the x-direction by the first entry of the list and otherwise for y-direction with the second entry. This also includes the create_mask function and relative_distance function and note that the displacement now also needs two directions to consider, so the CyclicShift also needs to be adjusted for that. In general however it should not be too difficult to adjust for all this.

I think it depends if you are willing to give up the global receptive field, because otherwise by making the window_size way larger to [7, 7x8] you add a lot of extra parameters as the computational complexity of the WindowAttention is quadratic in the window_size. If you computationally can afford it, it definitely could make sense otherwise I think you already reach global receptive field if you have 6 layers in the third stage and 2 layers in the fourth stage (as they are using) while using a window_size of [7, 28]. For everything below you would need to take the tradeoff between not being totally global but saving much compute and memory.

thank you sir!🙇