Further Pre-training public checkpoint
antonio-mastropaolo opened this issue · 0 comments
Hello everyone!
I'm further pre-training the following https://console.cloud.google.com/storage/browser/t5-data/pretrained_models/small/, however after the first pre-training, round I'm not able to further continue with the pre-training from the last checkpoint.
In other words, at the beginning I can run the pre-training without any problems, with the following command:
model.train("pretraining", TRAIN_STEPS, init_checkpoint = 'gs://t5-data/pretrained_models/small')
When I re-start the pre-training:
model.train("pretraining", TRAIN_STEPS, init_checkpoint = MODEL_DIR)
MODEL_DIR contains the pre-training (on-going) checkpoints
By doing so, I'm getting the following error:
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
RuntimeError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint
During handling of the above exception, another exception occurred:
NotFoundError Traceback (most recent call last)
NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint
During handling of the above exception, another exception occurred:
NotFoundError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py in restore(self, sess, save_path)
1318 # a helpful message (b/110263146)
1319 raise _wrap_restore_error_with_msg(
-> 1320 err, "a Variable name or other graph key that is missing")
1321
1322 # This is an object-based checkpoint. We'll print a warning and then do
NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
What is that I am missing here?
Thanks in advance for your help