Meituan-AutoML/Twins

Two gvt.py in your repository, what the difference between them?

JackeyGHD1 opened this issue · 2 comments

I find there are two gvt.py in your repository. One is in the main content, another one is in the segmentation content. By carefully comparing the two py files, I found that the calculation of group attention in gvt.py in the segmentation directory is different. One adds attn. Mask in gvt.py and the other does not. So I want to ask, attn. Mask
What is the role of the calculation of group attention? Why do you do this? Which gvt.py should I use when I'm doing a segmentation task?

Thanks for your attention. As for segmentation, we recommend using the one under the segmentation directory. We give the training script of segmentation in README.md.
In fact, there are three, one in the root, and the other two in the segmentation and detection directory.
As for classification, these three implementations are the same. And we use the neat implementation for classification to make our method easily understood. In fact, you can also use the other two in classification.
It's interesting to see that the other two implementations can directly use the pretrained backbone to handle down stream tasks.

However, as for down stream tasks such as detection and segmentation, we recommend using the latter because we need padding or masking to handle changing resolution during the training.
Note that you can choose mask or padding to handle the segmentation. We do some experiments, the both have very simliar performance.

Thanks for your attention. As for segmentation, we recommend using the one under the segmentation directory. We give the training script of segmentation in README.md.
In fact, there are three, one in the root, and the other two in the segmentation and detection directory.
As for classification, these three implementations are the same. And we use the neat implementation for classification to make our method easily understood. In fact, you can also use the other two in classification.
It's interesting to see that the other two implementations can directly use the pretrained backbone to handle down stream tasks.

However, as for down stream tasks such as detection and segmentation, we recommend using the latter because we need padding or masking to handle changing resolution during the training.
Note that you can choose mask or padding to handle the segmentation. We do some experiments, the both have very simliar performance.

Thank you for your reply!