The dimension of vit_clip and vit_imagenet
wlsrick opened this issue · 2 comments
wlsrick commented
Hello, I want to ask the input dimension before multi head attention between vit_clip.py and vit_imagenet.py
In vit_clip.py, the input dimension before T-MSA is t (b n) d, but in vit_imagenet.py, the input dimension before T-MSA is (b n) t d.
And I see the description of the paper, it is (N+1) x T x D.
So I want to ask which one is the correct one?
Thanks a lot.
taoyang1122 commented
Hi, they are the same. The difference is that the self-attention implementation is different in CLIP model codes and ViT codes. But they are both doing the self-attention on the T dimension. You may check their implementation details.
wlsrick commented
OK. Got it. Thanks a lot~