Resume ELMo training after crash
pjox opened this issue · 1 comments
Hello,
I'm currently trying to train ELMo with my own data, but sadly the process has crashed (cluster problem, nothing to do with the code). Since I have the checkpoints I don't want to loose days of training. However when I tried the restart.py
the perplexity jumped way up and it actually seems to me that it just started reading the data from the beginning once again, after all if I understood correctly the restart.py
is intended for fine-tuning, not for resuming a traning after a crash. Then I saw that in bilm/training.py
line 675 where the training function is provided, one can pass the checkpoint:
def train(options, data, n_gpus, tf_save_dir, tf_log_dir,
restart_ckpt_file=None):
and actually in line 770 of the same file, the checkpoint appear to be loaded (provided it is passed to the function):
if restart_ckpt_file is not None:
loader = tf.train.Saver()
loader.restore(sess, restart_ckpt_file)
However in the bin/train_elmo.py
there where the train function is called on line 63, the checkpoint file is not specified:
train(options, data, n_gpus, tf_save_dir, tf_log_dir)
Can I resume my training just putting the checkpoint there at the end? Do I have to do something else to resume training? Is it even possible to resume training without affecting perplexity?
Thank you in advance.