Doubly stochastic regularization coeffieicent
a7b23 opened this issue · 2 comments
a7b23 commented
When computing the value of doubly stochastic loss, why do you use the fraction 16/196? Shouldn't it be 1 going by the paper?
MenSanYan commented
@a7b23 Because the length of the input sequence is set as 16, attention mechanism is performed 16 times resulting in 16 alphas(weights) during one generating procedure.
nilinykh commented
@MenSanYan what would be the intuition behind it? I mean, I have not seen anywhere that people are using other numbers then 1 (as it is in the paper). Could you maybe navigate me towards some sources I can look at?