Doubly stochastic regularization coeffieicent

Question

Doubly stochastic regularization coeffieicent

a7b23 opened this issue 7 years ago · 2 comments

When computing the value of doubly stochastic loss, why do you use the fraction 16/196? Shouldn't it be 1 going by the paper?

Answer 1 · 2018-01-18T12:32:29.000Z

@a7b23 Because the length of the input sequence is set as 16, attention mechanism is performed 16 times resulting in 16 alphas(weights) during one generating procedure.

Answer 2 · 2020-06-04T16:55:36.000Z

@MenSanYan what would be the intuition behind it? I mean, I have not seen anywhere that people are using other numbers then 1 (as it is in the paper). Could you maybe navigate me towards some sources I can look at?