Train from logical error
hanskrupakar opened this issue · 6 comments
I am trying to resume training from checkpoint file and even though it says loaded model, the perplexity restarts at weight initialization level and the accuracy of translation when I use evaluate.lua also seems to indicate that the model is simply reinitializing the vectors instead of loading from checkpoint.
Is this an issue with the API? What am I doing wrong?
.......
Epoch: 4, Batch: 11850/11961, Batch size: 16, LR: 0.1000, PPL: 2565.87, |Param|: 5479.77, |GParam|: 44.02, Training: 134/65/69 total/source/target tokens/sec
Epoch: 4, Batch: 11900/11961, Batch size: 16, LR: 0.1000, PPL: 2573.56, |Param|: 5480.11, |GParam|: 46.07, Training: 134/65/69 total/source/target tokens/sec
Epoch: 4, Batch: 11950/11961, Batch size: 16, LR: 0.1000, PPL: 2580.50, |Param|: 5480.42, |GParam|: 90.12, Training: 134/65/69 total/source/target tokens/sec
Train 2582.1220978721
Valid 2958.3082902242
saving checkpoint to demo-model_epoch4.00_2958.31.t7
Script started on Monday 24 October 2016 08:55:52 AM IST
hans@hans-Lenovo-IdeaPad-Y500:~/seq2seq-attn-master$ th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model
using CUDA on GPU 1...
loading data...
done!
Source vocab size: 50004, Target vocab size: 150004
Source max sent len: 50, Target max sent len: 52
Number of additional features on source side: 0
Switching on memory preallocation
loading demo-model_epoch4.00_2958.31.t7...
Number of parameters: 84236504 (active: 84236504)
Epoch: 5, Batch: 50/11961, Batch size: 16, LR: 0.0500, PPL: 375825299.43, |Param|: 5407.84, |GParam|: 503.37, Training: 131/61/69 total/source/target tokens/sec
Epoch: 5, Batch: 100/11961, Batch size: 16, LR: 0.0500, PPL: 145308733.29, |Param|: 5407.19, |GParam|: 130.81, Training: 132/63/69 total/source/target tokens/sec
Epoch: 5, Batch: 150/11961, Batch size: 16, LR: 0.0500, PPL: 85249666.69, |Param|: 5406.86, |GParam|: 1190.36, Training: 133/64/69 total/source/target tokens/sec
I can't reproduce this on the latest revision.
- What is the command lines you used to start the training and to resume it?
- Did you do any changes to the code?
I didn't make any changes except specify the epoch to start the loading from. I have attached a log file specifying the train and load from commands, which remain the same except me specifying the load from file.
Something is not right. According to your log file, you always run the same command:
th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model
Is it the case?
- If not, can you share the actual command lines you ran?
- If yes, make sure you don't have local modifications in your source code. The logs you are getting do not reflect this command.
I ran it again from the beginning again after you said it was strange. Attached is the log file for that. Also attached is the train.lua
and preprocess.py
I used.
preprocess.py.docx
train.lua.docx
error.txt
It seems that AdaGrad does not play nicely with the train_from
option at the moment. I would advise you to stick with the default SGD which works well.
Also, please don't set your option within the code. It is error prone and harder for whoever might assist you to know what you are doing.
Will remember not to inline changes from now.
I implemented SGD and the train_from
works as expected.
Thanks.