yunjey/show-attend-and-tell

Doubly stochastic regularization coeffieicent

a7b23 opened this issue · 2 comments

a7b23 commented

When computing the value of doubly stochastic loss, why do you use the fraction 16/196? Shouldn't it be 1 going by the paper?

@a7b23 Because the length of the input sequence is set as 16, attention mechanism is performed 16 times resulting in 16 alphas(weights) during one generating procedure.

@MenSanYan what would be the intuition behind it? I mean, I have not seen anywhere that people are using other numbers then 1 (as it is in the paper). Could you maybe navigate me towards some sources I can look at?