question about "doubly stochastic attention"
zym1010 opened this issue · 2 comments
zym1010 commented
As I'm reading the paper, I don't understand, for the soft attention version, why we encourage \sum_{t} a_{ti} \approx 1
, as I feel C/L
would be more appropriate, since \sum_{t,i} a_{ti} = C
.
kelvinxu commented
Hi, we have a note of this in the paper. What you suggest is correct, for the results we reported though we used 1. This didn't really change the results in our experience.