Multi-Head Attention
Closed this issue · 1 comments
deepzlk commented
It seems that Multi-Head Attention did not implement multi heads=8?
Rubics-Xuan commented
attn = (q @ k.transpose(-2, -1)) * self.scale
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
Although there is no concatenation in the code which implements the multi-head, the codes above actually achieves the multi-head self-attention.