Multi-head attention part on ViT

Question

Multi-head attention part on ViT

andreYoo opened this issue 2 years ago · 0 comments

Can you confirm that the current implementation of the multi-head attention is the same as the original paper?

From this paper (vit.py, line # 55 and 56)
qkv = self.to_qkv(x).chunk(3, dim = -1)
q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h = self.heads), qkv)

It seems like split q,k,v to a multiple small size feature (in test.py, separating originally 1024D embedding features to 16 of 64D features).

However, in the actual paper, instead of dividing and processing 1024 features and then combining them, there is a process of putting 1024 features into n multi-head attention and then concatenating them.

Can you confirm that the implemented multi-head attention is the same as the actual paper?