Multi-Head Attention

Question

Closed this issue 3 years ago · 1 comments

It seems that Multi-Head Attention did not implement multi heads=8?

Answer 1 · 2021-04-25T03:13:08.000Z

attn = (q @ k.transpose(-2, -1)) * self.scale
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, C)

Although there is no concatenation in the code which implements the multi-head, the codes above actually achieves the multi-head self-attention.