Further Pre-training public checkpoint

Question

Further Pre-training public checkpoint

antonio-mastropaolo opened this issue 3 years ago · 0 comments

antonio-mastropaolo commented 3 years ago

Hello everyone!
I'm further pre-training the following https://console.cloud.google.com/storage/browser/t5-data/pretrained_models/small/, however after the first pre-training, round I'm not able to further continue with the pre-training from the last checkpoint.
In other words, at the beginning I can run the pre-training without any problems, with the following command:

 model.train("pretraining", TRAIN_STEPS, init_checkpoint = 'gs://t5-data/pretrained_models/small')

When I re-start the pre-training:

 model.train("pretraining", TRAIN_STEPS, init_checkpoint = MODEL_DIR)

MODEL_DIR contains the pre-training (on-going) checkpoints

By doing so, I'm getting the following error:

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last)
RuntimeError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

NotFoundError Traceback (most recent call last)
NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

NotFoundError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py in restore(self, sess, save_path)
1318 # a helpful message (b/110263146)
1319 raise _wrap_restore_error_with_msg(
-> 1320 err, "a Variable name or other graph key that is missing")
1321
1322 # This is an object-based checkpoint. We'll print a warning and then do

NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

What is that I am missing here?
Thanks in advance for your help