Output of different headsd in multi head attention is not averaged

Question

Output of different headsd in multi head attention is not averaged

sujithvemi opened this issue 4 years ago · 1 comments

In the adaptation of BatchMultiHeadAttention from xptree's implementation, the averaging over multiple heads is not implemented as in the original repository. That makes the following code which squeezes the second dimension not possible unless and until only single head is used.

x, attn = gat_layer(x)
if i + 1 == self.n_layer:
x = x.squeeze(dim=1)
else:
x = F.elu(x.transpose(1, 2).contiguous().view(bs, n, -1))
x = F.dropout(x, self.dropout, training=self.training)

Can you please explain if I am missing something here.

@huang-xx Thank you for the code along with the paper.

Answer 1 · 2020-01-24T17:26:20.000Z

@sujithvemi In the first 'Graph Attention Layer,' there are 4 heads, and the aggregated features from each head are concatenated. In the second 'Graph Attention Layer,' there is just 1 head.