Regarding softmax used for attention implementation.
ashim95 opened this issue · 1 comments
ashim95 commented
Thank you very much for sharing your code.
When calculating attention scores, you take softmax over the whole sequence. In case the sequence is of variable length, the softmax takes into account the scores (which are zeros) over the padded tokens also. Although eventually multiplying their attention scores with zeros from hidden states masks their contribution, the softmax implementation is still not correct as it can diminish the contribution (alphas) of non-padded time-steps.
Regarding this something like sparse softmax (??) could help.
Thank You,