sriniiyer/codenn

Can I continue the process if interrupted?

Closed this issue · 5 comments

I'm new to this field...
I was running my java data on the model, and after 22hours' running, the connection to my lab server failed...
How can I continue the process?

Besides, the java training data is 69k lines, and my lab's GPU is NVIDIA Tesla P4 (8 GB GDDR5), after 22 hours' training, the learning rate is still 0.32(init = 0.5), is it too slow?

Sorry I'm really a green hand and may ask some stupid questions... Thanks for your help:)

Probably you can save the model partially in different steps of epoch? So that when there is a DC from your server, you can reload the saved model and continue training where you last stop?

I suggest you use https://github.com/OpenNMT/OpenNMT-py with a MeanEncoder. It should be exactly the same as this model and should give you very similar results. It also supports model saving, so you can continue where you left off if it crashes.

Well, somehow I continued it ... I don't know if it was the right way, but it worked=.=
And I interrupted it after 78 epochs, while the learning rate is still over 0.1, because the training acc began to fall and was about 50% for several epoches, and my tutor told me that it has already converged... It's still a long way to the target 0.001
The result wasn't so bad, BLEU = 13.5 and METEOR = 8.3, but not so good as the paper stated.
Is it common or normal that the process converged while the learning rate is still high?
Or it means that it hadn't converge yet, or my coding was wrong?...
Thanks for your patience:)

You don't have to wait for the lr to be < 0.001. You should use the epoch with the best BLEU on the development set. I think I ran it for 80 epochs (checkout the paper).

Have you changed anything in the code or the dataset? If not, it should give very similar results as the paper. Need more information to debug.

I used a java dataset from my tutor, and she said the result was acceptable.
Thanks for your help :)