allenai/bilm-tf

Resume ELMo training after crash

pjox opened this issue · 1 comments

pjox commented

Hello,

I'm currently trying to train ELMo with my own data, but sadly the process has crashed (cluster problem, nothing to do with the code). Since I have the checkpoints I don't want to loose days of training. However when I tried the restart.py the perplexity jumped way up and it actually seems to me that it just started reading the data from the beginning once again, after all if I understood correctly the restart.py is intended for fine-tuning, not for resuming a traning after a crash. Then I saw that in bilm/training.py line 675 where the training function is provided, one can pass the checkpoint:

def train(options, data, n_gpus, tf_save_dir, tf_log_dir,
          restart_ckpt_file=None):

and actually in line 770 of the same file, the checkpoint appear to be loaded (provided it is passed to the function):

if restart_ckpt_file is not None:
            loader = tf.train.Saver()
            loader.restore(sess, restart_ckpt_file)

However in the bin/train_elmo.py there where the train function is called on line 63, the checkpoint file is not specified:

train(options, data, n_gpus, tf_save_dir, tf_log_dir)

Can I resume my training just putting the checkpoint there at the end? Do I have to do something else to resume training? Is it even possible to resume training without affecting perplexity?

Thank you in advance.

@pjox Have you found the solution?

It seems we need to fix the code in bin/train_elmo.py with providing explicit restart_ckpt_file argument.