The dimension of vit_clip and vit_imagenet

Question

The dimension of vit_clip and vit_imagenet

wlsrick opened this issue a year ago · 2 comments

Hello, I want to ask the input dimension before multi head attention between vit_clip.py and vit_imagenet.py
In vit_clip.py, the input dimension before T-MSA is t (b n) d, but in vit_imagenet.py, the input dimension before T-MSA is (b n) t d.
And I see the description of the paper, it is (N+1) x T x D.
So I want to ask which one is the correct one?
Thanks a lot.

Answer 1 · 2023-10-02T17:37:51.000Z

Hi, they are the same. The difference is that the self-attention implementation is different in CLIP model codes and ViT codes. But they are both doing the self-attention on the T dimension. You may check their implementation details.

Answer 2 · 2023-10-03T01:45:39.000Z

OK. Got it. Thanks a lot~