remicres/sr4rs

How use "--load_ckpts" ?

Closed this issue · 2 comments

Hello, I have a question that is probably silly but I'll ask it anyway. I'm trying to train my own model for super resolution with my satellite images on my pc. I had to stop the process but when I tried to restart it from the chekpoints files and "--load_ckpts" it was impossible to restart it. Can you tell me what argument "--load_ckpts" should take? Because when I try my with my last checkpoints (index file, meta or DATA-0000-OF-000001) here is the error I get :

(DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): NOT_FOUND: Tensor name "apply_dis_gradients/beta1_power" not found in checkpoint files results/ckpts/SR4RS_E120_B4_LR0.0002_Gan1.0_L1-200.0_VGG3e-05_VGGFeat1234_LossWGAN-GP_D64_RB16_LRSC0.0001_HRSC0.0001_24May_13h4min-37.index
[[{{node save/RestoreV2}}]]
[[save/RestoreV2/_301]]
Traceback (most recent call last):
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/client/session.py", line 1378, in _do_call
return fn(*args)
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/client/session.py", line 1361, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/client/session.py", line 1454, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) NOT_FOUND: Tensor name "apply_dis_gradients/beta1_power" not found in checkpoint files results/ckpts/SR4RS_E120_B4_LR0.0002_Gan1.0_L1-200.0_VGG3e-05_VGGFeat1234_LossWGAN-GP_D64_RB16_LRSC0.0001_HRSC0.0001_24May_13h4min-37.index
[[{{node save/RestoreV2}}]]
[[save/RestoreV2/_301]]
(1) NOT_FOUND: Tensor name "apply_dis_gradients/beta1_power" not found in checkpoint files results/ckpts/SR4RS_E120_B4_LR0.0002_Gan1.0_L1-200.0_VGG3e-05_VGGFeat1234_LossWGAN-GP_D64_RB16_LRSC0.0001_HRSC0.0001_24May_13h4min-37.index
[[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/training/saver.py", line 1418, in restore
sess.run(self.saver_def.restore_op_name,
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/client/session.py", line 968, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/client/session.py", line 1191, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/client/session.py", line 1371, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/client/session.py", line 1397, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.NotFoundError: Graph execution error:

Detected at node 'save/RestoreV2' defined at (most recent call last):
File "/home/d/Travail/super_resolution/code/train.py", line 329, in
tf.compat.v1.app.run(main)
File "/opt/otbtf/lib/python3/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/opt/otbtf/lib/python3/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/d/Travail/super_resolution/code/train.py", line 234, in main
saver = tf.compat.v1.train.Saver(max_to_keep=5)
Node: 'save/RestoreV2'
Detected at node 'save/RestoreV2' defined at (most recent call last):
File "/home/d/Travail/super_resolution/code/train.py", line 329, in
tf.compat.v1.app.run(main)
File "/opt/otbtf/lib/python3/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/opt/otbtf/lib/python3/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/d/Travail/super_resolution/code/train.py", line 234, in main
saver = tf.compat.v1.train.Saver(max_to_keep=5)
Node: 'save/RestoreV2'
2 root error(s) found.
(0) NOT_FOUND: Tensor name "apply_dis_gradients/beta1_power" not found in checkpoint files results/ckpts/SR4RS_E120_B4_LR0.0002_Gan1.0_L1-200.0_VGG3e-05_VGGFeat1234_LossWGAN-GP_D64_RB16_LRSC0.0001_HRSC0.0001_24May_13h4min-37.index
[[{{node save/RestoreV2}}]]
[[save/RestoreV2/_301]]
(1) NOT_FOUND: Tensor name "apply_dis_gradients/beta1_power" not found in checkpoint files results/ckpts/SR4RS_E120_B4_LR0.0002_Gan1.0_L1-200.0_VGG3e-05_VGGFeat1234_LossWGAN-GP_D64_RB16_LRSC0.0001_HRSC0.0001_24May_13h4min-37.index
[[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'save/RestoreV2':
File "/home/d/Travail/super_resolution/code/train.py", line 329, in
tf.compat.v1.app.run(main)
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/platform/app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/opt/otbtf/lib/python3/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/opt/otbtf/lib/python3/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/d/Travail/super_resolution/code/train.py", line 234, in main
saver = tf.compat.v1.train.Saver(max_to_keep=5)
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/training/saver.py", line 934, in init
self.build()
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/training/saver.py", line 946, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/training/saver.py", line 974, in _build
self.saver_def = self._builder._build_internal( # pylint: disable=protected-access
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/training/saver.py", line 543, in _build_internal
restore_op = self._AddRestoreOps(filename_tensor, saveables,
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/training/saver.py", line 360, in _AddRestoreOps
all_tensors = self.bulk_restore(filename_tensor, saveables, preferred_shard,
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/training/saver.py", line 611, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1516, in restore_v2
_, _, _op, _outputs = _op_def_library._apply_op_helper(
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper
op = g._create_op_internal(op_type_name, inputs, dtypes=None,
File "/opt/otbtf/lib/python3/dist-packages/tensorflow/python/framework/ops.py", line 3814, in _create_op_internal
ret = Operation(

Thanking you in advance

Adrien

Hi Adrien,

As stated from the parser of train.py for load_ckpt: "Path to an existing checkpoint (provide the full path without the .meta)"

I guess that would help!

Hi Remi,

Thanks for your reply, indeed I tried all possible combinations but not the whole file name without the extension, I will try that later. I will keep you posted.

Thank you, good afternoon,

Adrien