Sigmoid in AttentionLSTM
bkj opened this issue · 0 comments
bkj commented
I noticed that you run the attention through a sigmoid because you were having numerical problems:
https://github.com/codekansas/keras-language-modeling/blob/master/attention_lstm.py#L54
This may work, but I think that should actually be a softmax. In the paper you cite, it only says that the activation should be proportional to
exp(dot(m, U_s))
In another paper [1], they explicitly say it should be
softmax(exp(dot(m, U_s)))