carpedm20/lstm-char-cnn-tensorflow

Validation perplexity is 146.71 at the end of training (24 epochs)

ygoncharov opened this issue · 12 comments

(it should get ~82 on valid and ~79 on test)

$ python main.py --dataset ptb

.....

epoch: [24] [ 250/ 265] loss: 3.466149
Valid: loss: 5.225354, perplexity: 185.927017
{'perplexity': 83.749542031012467, 'epoch': 24, 'valid_perplexity': 146.71359295576036, 'learning_rate': 0.5}
[] Saving checkpoints...
Test: loss: 4.836956, perplexity: 126.084908
[
] Test loss: 4.954320, perplexity: 141.786226

I'm working on this issue and I don't think the current implementation is different from the original model. I checked the model validity by comparing the losses of a single batch during the early epochs and there are no differences. Also, I checked the perplexity of training set goes down to 90.

loss

One thing I'm working on is to change the testing algorithm which is different from the original. The original code calculate the whole perplexity of all test data in a single forward pass but this repo calculates the perplexity of test data same as the training data, which is batch averaged perplexity. This will reduce the perplexity in some way.. but not sure this will make the comparable results.

If you find any other differences, feel free to share it to me 😄

Cool stuff!
I noticed on the README that you are using 100/150 hidden units for small/large models respectively. I actually use 300/650 hidden units, so this might explain the difference in performance. Also, it seems like you are using RMSProp? I've found vanilla SGD with starting learning rate of 1.0 (halved every time the perplexity does not improve on dev set) to work much better than other optimization methods, including RMSProp.

Hope this helps.

@yoonkim Hi! Thanks for sharing your great work and I enjoyed the paper very well! Actually, README is an old one which I forgot to update it (now I fixed it) and the code already uses same hidden units, optimizer, and decay as you mentioned..

Ah ok! Few other things may be:

  • batch size
  • parameter initialization

Thanks! I'll dig into those things and how was the perplexity on training set after the training?

I think it should be a lot lower. I don't recall the numbers exactly but since the dataset is small and the model has a lot of capacity (even with dropout) training PPL should be well below 50.

@carpedm20 Hi,
Did you find any possibles pointers on this issue of high test perplexity? I was trying to debug it and any help would be appreciated.

yss4 commented

@carpedm20 Hello, thanks for sharing your code in github. I also noticed that the problem of getting high perplexity on PTB test set is still ongoing. Have you had a chance to deal with this issue or any pointer to fix it? Thanks in advance.

@nileshkulkarni @yss4 No, I couldn't find the reason of problem yet and I'm not working on this project now. But if you share me any weird codes that is different from the original paper, please share it and I'll take a look at it.

@carpedm20 This implementation is NOT identical to the original.

Interested reader can have a look at my code here:
https://github.com/mkroutikov/tf-lstm-char-cnn
that does reproduce Yoon Kim's redult in TF.

I ran the code yesterday and received a result of 156.097 averaged validation PPL, 149.565 averaged test PPL. So I am reading your code and the original.The first different thing I found was the criterion, yours is CE while the original is NLL.Does it matter?

Thanks for sharing your code. I want to know how can I train a model in word_level? I found you code has the things like ( use_char = Ture, use_word = False). Is it useful to adjust the 'use_word = Ture'? Looking forward to your answer, thank you.