orenmel/context2vec

Got all nan after training 3 epoches

mfxss opened this issue · 11 comments

mfxss commented

I trained a [context2vec].(https://github.com/orenmel/context2vec/blob/master/context2vec/train/train_context2vec.py)
I printed the context_v in explore_context2vec.py, and got all nan.
When using tensorflow, I can fix this by adding a small number to the loss function.

cross_entropy = self.target * tf.log(self.prediction+1e-10)
How to do this in chainer?

context2vec uses chainer's built-in negative sampling cost function (not cross_entropy) and this function should be stable. nan values could be related to the configuration of the optimizer, though. I think you should consider configuring the Adam optimizer with a slower learning rate or trying a different optimizer (e.g. SGD).

mfxss commented

What is your final loss? You didn't set learning rate in your code. And it seems that there is not a direct setting way in chainer.

The loss function is chainer.links.NegativeSampling (a popular word2vec loss function). You can look at the code to see exactly how I use it.

mfxss commented

Yes, I read the source code. And the learning rate needs to be setted using this:

def lr(self):
        fix1 = 1. - self.beta1 ** self.t
        fix2 = 1. - self.beta2 ** self.t
        return self.alpha * math.sqrt(fix2) / fix1

You didn't set learning rate in your code.
Is there another way to change it?
I added some information to the training data of ukwac. For the first epoch the accuracy is 71.6%, for the second it comes to 70.6% on SE3. To compare with your model, I didn't use other optimizer and parameters.

Will it help if set a drop ratio? My accum_loss/word is about 0.5 now. What is yours after 3 epoches?
Thank you very much.

I experimented a little with dropout early on and didn't see it helping much on the dev set so I didn't use it, but I didn't explore this option thoroughly. In your case, where performance seems to decrease with epochs, it does seem to make sense to try dropout.

If you're first trying to reproduce my results the first thing is to make sure that your setting is identical. Please double check that you followed the ukwac preprocessing described in the paper (lowercasing, all words with frequency less than 100 converted to UNK, etc.). Also, would be good if you could share with me the exact command line that you're using to train your model, so I could double check that you're using the same arguments as I did.

PS I already finished my phd and don't currently have a setup for running context2vec, so I can't tell you my accum_loss/word.

EDIT: Actually, lowercasing of corpus is done by default in context2vec, so no need to worry about that. The trimming of words with a frequency lower than 100 is done by using the -t argument in train_context2vec.py

mfxss commented

Thank you for your kindness.
I used this to train.
python context2vec//train/train_context2vec.py -i CORPUS_FILE.DIR -w WORD_EMBEDDINGS -m MODEL -c lstm --deep yes -t 100 --dropout 0.0 -u 300 -e 10 -p 0.75 -b 800 -g 0
And I add the pos tag to each word like 'apple_nn' to train the model. I did this to the training and testing file too .
But the accuracy is a bit lower than yours. There is nothing wrong with the data reading progress.
I changed the learning rate from 0.001 to 0.0005. Do you have any advice for me?

I used -b 1000, but that shouldn't make a difference. I also used max-sent-len=64 when running corpus_by_sent_length.py. Of all that, the addition of the pos tag seems to be the one factor that could be significant. I recommend that you try without it.

Hello! We are also having this problem (training on gigaword with ≈100k vocab size); we're going to try switching from Adam to SGD for the optimizer, but can you think of any other reason for the numerical instability? From looking at the train_context2vec.py debugging output, it looks like what's happening is that at a certain point something's going haywire with the loss calculation:

...
1870001000 words, 113.03 sec, 8851.34 words/sec, 0.5178 accum_loss/word, 0.5133 cur_loss/word
1871000900 words, 113.64 sec, 8798.89 words/sec, 0.5178 accum_loss/word, 0.5067 cur_loss/word
1872001300 words, 114.65 sec, 8725.45 words/sec, 0.5178 accum_loss/word, 0.5145 cur_loss/word
1873001400 words, 113.60 sec, 8804.03 words/sec, 0.5178 accum_loss/word, 0.5002 cur_loss/word
1874001000 words, 112.24 sec, 8906.06 words/sec, 0.5177 accum_loss/word, 0.5073 cur_loss/word
1875001800 words, 113.57 sec, 8812.27 words/sec, 0.5177 accum_loss/word, 0.5154 cur_loss/word
1876000200 words, 112.69 sec, 8859.95 words/sec, 0.5177 accum_loss/word, 0.5180 cur_loss/word
1877001800 words, 113.26 sec, 8843.53 words/sec, 0.5177 accum_loss/word, 0.5158 cur_loss/word
1878001000 words, 112.73 sec, 8863.40 words/sec, 0.5177 accum_loss/word, 0.5075 cur_loss/word
1879002500 words, 113.53 sec, 8821.38 words/sec, 0.5177 accum_loss/word, 0.5180 cur_loss/word
1880002900 words, 113.68 sec, 8800.24 words/sec, 0.5178 accum_loss/word, 0.6031 cur_loss/word
1881000000 words, 112.80 sec, 8839.40 words/sec, 0.5180 accum_loss/word, 0.9049 cur_loss/word
1882002600 words, 113.07 sec, 8867.03 words/sec, 0.5196 accum_loss/word, 3.5249 cur_loss/word
1883000200 words, 112.69 sec, 8852.61 words/sec, 0.5209 accum_loss/word, 2.9450 cur_loss/word
1884000600 words, 113.57 sec, 8808.81 words/sec, 0.5218 accum_loss/word, 2.2366 cur_loss/word
1885001200 words, 113.18 sec, 8841.15 words/sec, 0.0000 accum_loss/word, nan cur_loss/word
1886004100 words, 113.48 sec, 8837.67 words/sec, 0.0000 accum_loss/word, nan cur_loss/word
1887001500 words, 112.52 sec, 8864.07 words/sec, 0.0000 accum_loss/word, nan cur_loss/word
1888000600 words, 112.87 sec, 8851.65 words/sec, 0.0000 accum_loss/word, nan cur_loss/word
...

Looking at the code, it looks like the loss NegativeSampling loss function is returning NaN when asked for its data on line 138 of train_context2vec.py, which sounds to me like either gradient vanishing or exploding. But I'm a bit surprised to see it happen this way- seems like it should have taken a little longer to blow up (or vanish), and also would've resulted in some more dramatic fluctuations in the output numbers before hitting NaN.

Training happily continues for a few more days afterwards, but the resulting vectors are (unsurprisingly) full of NaNs. It happens at different points in the training (i.e., not always after the same number of words) so it doesn't seem to be caused by a malformed chunk of input, or anything like that, at least not as far as we can tell.

mfxss commented

@stevenbedrick I decreased the learning rate and got a better result. When you got nan the model is very likely to overfit. So you can stop for you have got a really small number that can't be recognised. Hope my experience helps.

Yes. I'd suggest you try decreasing the learning rate. That can be done both for Adam and SGD.

OK, we're going to try turning Adam's alpha value down to 0.0005 and see if that gets us anywhere. Thanks for the tips!