Mutlihead attention implementation

Question

Mutlihead attention implementation

Opened this issue 4 years ago · 1 comments

 # Concatenate context vector with input (most important)

    result = t.cat([decoder_input, result], dim=-1)

Excuse me. I don't think I have seen concatenating multiheads with original input when doing self-attention.
Plus you commented it as important. I guess I am missing some thing?
Do you mind if I ask which paper you referred to when implementing this part of multihead attention?

Answer 1 · 2020-05-28T10:07:25.000Z

Same question here.
I didn't see any reference in the transformer TTS paper
https://arxiv.org/abs/1809.08895
EDIT : It might be link to "The multi-head attention can integrate the encoder hidden states in multi- ple perspectives and generate better context vectors"
In section 3.6 of the paper. Not sure of my interpretation