the `model_executor.py` example is broken
XMaster96 opened this issue · 0 comments
XMaster96 commented
I tried to run the model_executor.py
example but I get a TPU device path mismatch. I already read through the TF and MTF code to figure out my self what is going on, but I could not find anything that was helpful. Maybe some one of you could point me in the right direction.
Thanks for the help in advance.
here are the errors:
INFO:tensorflow:Create CheckpointSaverHook.
I0122 17:16:08.101019 139623540565824 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
WARNING:tensorflow:From /home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0122 17:16:12.156689 139623540565824 deprecation.py:323] From /home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Starting the session.
I0122 17:16:12.169167 139623540565824 ops.py:5748] Starting the session.
WARNING:tensorflow:From /home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py:1475: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0122 17:16:12.264366 139623540565824 deprecation.py:323] From /home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/ops/array_ops.py:1475: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
INFO:tensorflow:Graph was finalized.
I0122 17:16:12.353913 139623540565824 monitored_session.py:240] Graph was finalized.
Traceback (most recent call last):
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
self._extend_graph()
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation list_files/MatchingFiles: {{node list_files/MatchingFiles}} was explicitly assigned to /job:tpu_worker/task:0/device:CPU:0 but available devices are [ /job:worker/replica:0/task:0/device:CPU:0, /job:worker/replica:0/task:0/device:TPU:0, /job:worker/replica:0/task:0/device:TPU:1, /job:worker/replica:0/task:0/device:TPU:2, /job:worker/replica:0/task:0/device:TPU:3, /job:worker/replica:0/task:0/device:TPU:4, /job:worker/replica:0/task:0/device:TPU:5, /job:worker/replica:0/task:0/device:TPU:6, /job:worker/replica:0/task:0/device:TPU:7, /job:worker/replica:0/task:0/device:TPU_SYSTEM:0, /job:worker/replica:0/task:0/device:XLA_CPU:0, /job:worker/replica:0/task:1/device:CPU:0, /job:worker/replica:0/task:1/device:TPU:0, /job:worker/replica:0/task:1/device:TPU:1, /job:worker/replica:0/task:1/device:TPU:2, /job:worker/replica:0/task:1/device:TPU:3, /job:worker/replica:0/task:1/device:TPU:4, /job:worker/replica:0/task:1/device:TPU:5, /job:worker/replica:0/task:1/device:TPU:6, /job:worker/replica:0/task:1/device:TPU:7, /job:worker/replica:0/task:1/device:TPU_SYSTEM:0, /job:worker/replica:0/task:1/device:XLA_CPU:0, /job:worker/replica:0/task:2/device:CPU:0, /job:worker/replica:0/task:2/device:TPU:0, /job:worker/replica:0/task:2/device:TPU:1, /job:worker/replica:0/task:2/device:TPU:2, /job:worker/replica:0/task:2/device:TPU:3, /job:worker/replica:0/task:2/device:TPU:4, /job:worker/replica:0/task:2/device:TPU:5, /job:worker/replica:0/task:2/device:TPU:6, /job:worker/replica:0/task:2/device:TPU:7, /job:worker/replica:0/task:2/device:TPU_SYSTEM:0, /job:worker/replica:0/task:2/device:XLA_CPU:0, /job:worker/replica:0/task:3/device:CPU:0, /job:worker/replica:0/task:3/device:TPU:0, /job:worker/replica:0/task:3/device:TPU:1, /job:worker/replica:0/task:3/device:TPU:2, /job:worker/replica:0/task:3/device:TPU:3, /job:worker/replica:0/task:3/device:TPU:4, /job:worker/replica:0/task:3/device:TPU:5, /job:worker/replica:0/task:3/device:TPU:6, /job:worker/replica:0/task:3/device:TPU:7, /job:worker/replica:0/task:3/device:TPU_SYSTEM:0, /job:worker/replica:0/task:3/device:XLA_CPU:0 ]. Make sure the device specification refers to a valid device.
[[list_files/MatchingFiles]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "model_executor.py", line 590, in <module>
tf.app.run()
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/xmaster/neo_env/lib/python3.6/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/xmaster/neo_env/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "model_executor.py", line 586, in main
train_and_eval()
File "model_executor.py", line 575, in train_and_eval
_train_phase(mesh_context, config, resolver.get_master())
File "model_executor.py", line 449, in _train_phase
_run_train_phase()
File "model_executor.py", line 427, in _run_train_phase
config=config) as sess:
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
return self._sess_creator.create_session()
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
init_fn=self._scaffold.init_fn)
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation list_files/MatchingFiles: node list_files/MatchingFiles (defined at /home/xmaster/neo_env/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) was explicitly assigned to /job:tpu_worker/task:0/device:CPU:0 but available devices are [ /job:worker/replica:0/task:0/device:CPU:0, /job:worker/replica:0/task:0/device:TPU:0, /job:worker/replica:0/task:0/device:TPU:1, /job:worker/replica:0/task:0/device:TPU:2, /job:worker/replica:0/task:0/device:TPU:3, /job:worker/replica:0/task:0/device:TPU:4, /job:worker/replica:0/task:0/device:TPU:5, /job:worker/replica:0/task:0/device:TPU:6, /job:worker/replica:0/task:0/device:TPU:7, /job:worker/replica:0/task:0/device:TPU_SYSTEM:0, /job:worker/replica:0/task:0/device:XLA_CPU:0, /job:worker/replica:0/task:1/device:CPU:0, /job:worker/replica:0/task:1/device:TPU:0, /job:worker/replica:0/task:1/device:TPU:1, /job:worker/replica:0/task:1/device:TPU:2, /job:worker/replica:0/task:1/device:TPU:3, /job:worker/replica:0/task:1/device:TPU:4, /job:worker/replica:0/task:1/device:TPU:5, /job:worker/replica:0/task:1/device:TPU:6, /job:worker/replica:0/task:1/device:TPU:7, /job:worker/replica:0/task:1/device:TPU_SYSTEM:0, /job:worker/replica:0/task:1/device:XLA_CPU:0, /job:worker/replica:0/task:2/device:CPU:0, /job:worker/replica:0/task:2/device:TPU:0, /job:worker/replica:0/task:2/device:TPU:1, /job:worker/replica:0/task:2/device:TPU:2, /job:worker/replica:0/task:2/device:TPU:3, /job:worker/replica:0/task:2/device:TPU:4, /job:worker/replica:0/task:2/device:TPU:5, /job:worker/replica:0/task:2/device:TPU:6, /job:worker/replica:0/task:2/device:TPU:7, /job:worker/replica:0/task:2/device:TPU_SYSTEM:0, /job:worker/replica:0/task:2/device:XLA_CPU:0, /job:worker/replica:0/task:3/device:CPU:0, /job:worker/replica:0/task:3/device:TPU:0, /job:worker/replica:0/task:3/device:TPU:1, /job:worker/replica:0/task:3/device:TPU:2, /job:worker/replica:0/task:3/device:TPU:3, /job:worker/replica:0/task:3/device:TPU:4, /job:worker/replica:0/task:3/device:TPU:5, /job:worker/replica:0/task:3/device:TPU:6, /job:worker/replica:0/task:3/device:TPU:7, /job:worker/replica:0/task:3/device:TPU_SYSTEM:0, /job:worker/replica:0/task:3/device:XLA_CPU:0 ]. Make sure the device specification refers to a valid device.
[[list_files/MatchingFiles]]
Note: TF version == 1.15.4