Loss / Error scailing issues: Do we need to divide it by length of words and sentences?

Question

Loss / Error scailing issues: Do we need to divide it by length of words and sentences?

Closed this issue 8 years ago · 1 comments

Hi. First of all, thanks for sharing your work with us :)
I wonder your opinion regarding scailing of loss & error in your model. (Hierarchical RNN Encoder-Decoder). I am referring your paper at http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/11957/12160

Question1) Which one is your objective function? equation (1) or equation (5) ?

Question2) Regrading calculation of loss / backpropagation error (from decoder RNN to context RNN), which way is reasonable for its scailing?
a. no scailing (triples with large number of words tends to have high loss value)
b. scailing by total words in triples
c. scailing by the number of words in each target sentences

Thanks for your opinion :)

Answer 1 · 2016-06-11T00:01:46.000Z

Thanks!

The objective function being optimized using stochastic gradient descent is the log-likelihood (e.g. the log of eq. (1) w.r.t. the data examples). A monotonic transformation of eq. (1) leads to eq. (5).

Regarding your question on scaling, I don't know which method will work better. The only scaling we are doing now is dividing by the number of examples in each batch, and that seems to work OK. It might be worth trying the other methods though.