EleutherAI/gpt-neo

[colab notebooks] Can't restore pretrained weights

sky1ove opened this issue · 2 comments

for running the GPTNeo_example_notebook.ipynb notebook, it can't restore pretrained weights:
bucket uploading just works fine, I also tried !gcloud auth application-default login, but didn't help
Here is my notebook that generate the error:https://colab.research.google.com/drive/1HDSVSsppKqwXHCZG9gwIb-VpgB2zusL3?usp=sharing

Restoring parameters from gs://sky1ove2/GPT3_2-7B/model.ckpt-400000
training_loop marked as finished
Reraising captured error
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1375, in _do_call
return fn(*args)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to get matching files on gs://sky1ove2/GPT3_2-7B/model.ckpt-400000: Permission denied: Error executing an HTTP request: HTTP response code 403 with body '{
"error": {
"code": 403,
"message": "service-495559152420@cloud-tpu.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket.",
"errors": [
{
"message": "service-495559152420@cloud-tpu.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket.",
"domain": "global",
"reason": "forbidden"
}
]
}
}
'
when reading gs://sky1ove2/GPT3_2-7B
[[{{node save/RestoreV2_1}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 1304, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 968, in run
run_metadata_ptr)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1191, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1369, in _do_run
run_metadata)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1394, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to get matching files on gs://sky1ove2/GPT3_2-7B/model.ckpt-400000: Permission denied: Error executing an HTTP request: HTTP response code 403 with body '{
"error": {
"code": 403,
"message": "service-495559152420@cloud-tpu.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket.",
"errors": [
{
"message": "service-495559152420@cloud-tpu.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket.",
"domain": "global",
"reason": "forbidden"
}
]
}
}
'
when reading gs://sky1ove2/GPT3_2-7B
[[node save/RestoreV2_1 (defined at /content/GPTNeo/model_fns.py:248) ]]

Original stack trace for 'save/RestoreV2_1':
File "main.py", line 257, in
main(args)
File "main.py", line 251, in main
estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=params["train_steps"])
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3105, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
self.config)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2942, in _call_model_fn
config)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3233, in _model_fn
_train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3697, in _train_on_tpu_system
device_assignment=ctx.device_assignment)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/tpu.py", line 1826, in split_compile_and_shard
xla_options=xla_options)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/tpu.py", line 1492, in split_compile_and_replicate
outputs = computation(*computation_inputs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3683, in multi_tpu_train_steps_on_single_shard
inputs=[0, _INITIAL_LOSS])
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/training_loop.py", line 186, in while_loop
condition_wrapper, body_wrapper, inputs, name="", parallel_iterations=1)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2816, in while_loop
return_same_structure)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2298, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2223, in _BuildLoop
body_result = body(packed_vars_for_body)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/training_loop.py", line 129, in body_wrapper
outputs = body(
(inputs + dequeue_ops))
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3682, in
lambda i, loss: [i + 1, single_tpu_train_step(i)],
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1782, in train_step
self._call_model_fn(features, labels))
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2072, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "/content/GPTNeo/model_fns.py", line 248, in model_fn
save_relative_paths=True)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 836, in init
self.build()
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 848, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 886, in _build
build_restore=build_restore)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 510, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 389, in _AddShardedRestoreOps
name="restore_shard"))
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 336, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1493, in restore_v2
name=name)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 3565, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 2045, in init
self._traceback = tf_stack.extract_stack_for_node(self._c_op)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 257, in
main(args)
File "main.py", line 251, in main
estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=params["train_steps"])
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3110, in train
rendezvous.raise_errors()
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python3.7/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3105, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1208, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1510, in _train_with_estimator_spec
save_graph_def=self._config.checkpoint_save_graph_def) as mon_sess:
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/monitored_session.py", line 605, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1039, in init
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/monitored_session.py", line 750, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1232, in init
_WrappedSession.init(self, self._create_session())
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1237, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/monitored_session.py", line 903, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/monitored_session.py", line 670, in create_session
init_fn=self._scaffold.init_fn)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/session_manager.py", line 321, in prepare_session
config=config)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/session_manager.py", line 251, in _restore_checkpoint
sess, saver, ckpt.model_checkpoint_path)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/session_manager.py", line 71, in _restore_checkpoint_and_maybe_run_saved_model_initializers
saver.restore(sess, path)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 1340, in restore
err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

From /job:worker/replica:0/task:0:
Unsuccessful TensorSliceReader constructor: Failed to get matching files on gs://sky1ove2/GPT3_2-7B/model.ckpt-400000: Permission denied: Error executing an HTTP request: HTTP response code 403 with body '{
"error": {
"code": 403,
"message": "service-495559152420@cloud-tpu.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket.",
"errors": [
{
"message": "service-495559152420@cloud-tpu.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket.",
"domain": "global",
"reason": "forbidden"
}
]
}
}
'
when reading gs://sky1ove2/GPT3_2-7B
[[node save/RestoreV2_1 (defined at /content/GPTNeo/model_fns.py:248) ]]

Original stack trace for 'save/RestoreV2_1':
File "main.py", line 257, in
main(args)
File "main.py", line 251, in main
estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=params["train_steps"])
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3105, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
self.config)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2942, in _call_model_fn
config)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3233, in _model_fn
_train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn))
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3697, in _train_on_tpu_system
device_assignment=ctx.device_assignment)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/tpu.py", line 1826, in split_compile_and_shard
xla_options=xla_options)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/tpu.py", line 1492, in split_compile_and_replicate
outputs = computation(*computation_inputs)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3683, in multi_tpu_train_steps_on_single_shard
inputs=[0, _INITIAL_LOSS])
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/training_loop.py", line 186, in while_loop
condition_wrapper, body_wrapper, inputs, name="", parallel_iterations=1)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2816, in while_loop
return_same_structure)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2298, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2223, in _BuildLoop
body_result = body(packed_vars_for_body)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/tpu/training_loop.py", line 129, in body_wrapper
outputs = body(
(inputs + dequeue_ops))
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3682, in
lambda i, loss: [i + 1, single_tpu_train_step(i)],
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1782, in train_step
self._call_model_fn(features, labels))
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2072, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "/content/GPTNeo/model_fns.py", line 248, in model_fn
save_relative_paths=True)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 836, in init
self.build()
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 848, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 886, in _build
build_restore=build_restore)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 510, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 389, in _AddShardedRestoreOps
name="restore_shard"))
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 336, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py", line 583, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1493, in restore_v2
name=name)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 3565, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 2045, in init
self._traceback = tf_stack.extract_stack_for_node(self._c_op)

I have the same error

Unsuccessful TensorSliceReader constructor: Failed to get matching files on gs://my_bucket/GPT3_XL/model.ckpt-362000: Permission denied: Error executing an HTTP request: HTTP response code 403 with body '{
"error": {
"code": 403,
"message": "service-495559152420@cloud-tpu.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket.",
"errors": [
{
"message": "service-495559152420@cloud-tpu.iam.gserviceaccount.com does not have storage.objects.list access to the Google Cloud Storage bucket.",
"domain": "global",
"reason": "forbidden"
}
]
}
}

in my bucket under permissions i added service-495559152420@cloud-tpu.iam.gserviceaccount.com and gave it Storage Legacy Object Owner and its working i just hope I'm not actually screwing it up more somehow

This seems like a Google issue, not a GPT-Neo issue.