taoyang1122/adapt-image-models

The dimension of vit_clip and vit_imagenet

wlsrick opened this issue · 2 comments

Hello, I want to ask the input dimension before multi head attention between vit_clip.py and vit_imagenet.py
In vit_clip.py, the input dimension before T-MSA is t (b n) d, but in vit_imagenet.py, the input dimension before T-MSA is (b n) t d.
And I see the description of the paper, it is (N+1) x T x D.
So I want to ask which one is the correct one?
Thanks a lot.

Hi, they are the same. The difference is that the self-attention implementation is different in CLIP model codes and ViT codes. But they are both doing the self-attention on the T dimension. You may check their implementation details.

OK. Got it. Thanks a lot~