How to resume from a checkpoint?
Closed this issue · 4 comments
Hello, I want to resume training from a checkpoint, tried to set opt.checkpoint=true, then I got error:
/root/torch/install/bin/luajit: train.lua:227: attempt to index global 'checkpoint' (a nil value)
stack traceback:
train.lua:227: in function 'hooks'
./engines/fboptimengine.lua:50: in function 'train'
train.lua:363: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x004064f0
not sure checkpointing works correctly right now, please use retrain
option while we fix checkpointing.
@szagoruyko , Thank you
I notice in logs dir, there is :
transformer.t7,optimState_500.t7,model_500.t7
So I set :
retrain=model_500.t7
transformer=transformer.t7
But where to set optimState_500.t7?
@northeastsquare there is no option, momentum will be reset.
@szagoruyko OK