Regarding softmax used for attention implementation.

Question

Regarding softmax used for attention implementation.

ashim95 opened this issue 6 years ago · 1 comments

Thank you very much for sharing your code.

When calculating attention scores, you take softmax over the whole sequence. In case the sequence is of variable length, the softmax takes into account the scores (which are zeros) over the padded tokens also. Although eventually multiplying their attention scores with zeros from hidden states masks their contribution, the softmax implementation is still not correct as it can diminish the contribution (alphas) of non-padded time-steps.

Regarding this something like sparse softmax (??) could help.

Thank You,

Answer 1 · 2018-07-11T18:45:39.000Z

Hi @ashim95.
I responded to an analogous question here:
#8
You can take a look first.