why the total similarity is kept as 1 in the corresponding line of Si, which may lead to wrong modifications?
Closed this issue · 4 comments
jj-ccc commented
Hi author, nice to read such an interesting paper, I would like to ask why the total similarity is kept as 1 in the corresponding line of Si, which may lead to wrong modifications?
jj-ccc commented
w1oves commented
- The application of softmax ensures that the similarity of each row sums to 1.
- Some features do not need modification, meaning they are dissimilar to every token; in such cases, it would be reasonable for the total similarity of that row to be lower.
- After removing a token, the similarity for each row can range from 0 to 1.
whyandbecause commented
您好,我同样关注到了这个问题,基于您的回复,我理解了这么处理的意图,但是我还有一个小疑问,在训练过程中是如何在Si中给不用改变的特征始终分配一个较大的值的?
w1oves commented
您好,我同样关注到了这个问题,基于您的回复,我理解了这么处理的意图,但是我还有一个小疑问,在训练过程中是如何在Si中给不用改变的特征始终分配一个较大的值的?
当所有其他值都较小时,根据softmax的特性,剩余的值将会相应增大。