question about the second formula of the article
Ceasar9999 opened this issue · 1 comments
Ceasar9999 commented
I found a mistake. Specifically, the second eqution showing in your paper is different with your code. The eqution 2 of your paper shows the Mask added with QK, but in your code, I found you use the function 'masked_fill' to achive multiplication of mask and QK. Please give me some explaination.
TemugeB commented
i have the same question. A mask could be all -inf
because everything was below the threshold. After softmax, these would return nan
tensors, which means no back propagation. How to mask properly in this case?