Trouble loading checkpoints

Question

Trouble loading checkpoints

Opened this issue 7 years ago · 1 comments

I need to be able to resume training breakout-v0 after stopping it. I would also like to be able to move a checkpoint dir to another machine and resume training there.

When I train on my laptop, using ubuntu 14.04, I am able to resume after stopping. But on the faster machine I really want to use, I can not resume after stopping. That machine uses ubuntu 16.04, FWIW.

Both machines use tensorflow 1.3.0. The working laptop uses python 3.6 and the non-working machine uses python 3.5.2. OpenAI gym is version 0.9.4 on both machines, as installed by pip. Neither machine uses GPU, and both use NHWC.

On both machines, I have cloned from the devsisters/DQN-tensorflow repository and manually fixed the bugs that prevent it from working with python 3.x.

`~/DQN-tensorflow$ python main.py --env_name=Breakout-v0 --is_train=True --display=False

[*] GPU : 1.0000
{'_save_step': 500000,
'_test_step': 50000,
'action_repeat': 4,
'backend': 'tf',
'batch_size': 32,
'cnn_format': 'NHWC',
'discount': 0.99,
'display': False,
'double_q': False,
'dueling': False,
'env_name': 'Breakout-v0',
'env_type': 'detail',
'ep_end': 0.1,
'ep_end_t': 1000000,
'ep_start': 1.0,
'history_length': 4,
'learn_start': 50000.0,
'learning_rate': 0.00025,
'learning_rate_decay': 0.96,
'learning_rate_decay_step': 50000,
'learning_rate_minimum': 0.00025,
'max_delta': 1,
'max_reward': 1.0,
'max_step': 50000000,
'memory_size': 1000000,
'min_delta': -1,
'min_reward': -1.0,
'model': 'm1',
'random_start': 30,
'scale': 10000,
'screen_height': 84,
'screen_width': 84,
'target_q_update_step': 10000,
'train_frequency': 4}
WARNING:tensorflow:From /home/mjc/DQN-tensorflow/dqn/agent.py:224: calling argmax (from tensorflow.python.ops.math_ops) with dimension is deprecated and will be removed in a future version.
Instructions for updating:
Use the axis argument instead
WARNING:tensorflow:From /opt/anaconda/miniconda3/envs/tfbuild/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py:107: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use tf.global_variables_initializer instead.

[*] Loading checkpoints...
[!] Load FAILED: checkpoints/Breakout-v0/backend-tf/ep_end-0.1/model-m1/screen_width-84/env_type-detail/learning_rate-0.00025/learning_rate_minimum-0.00025/memory_size-1000000/env_name-Breakout-v0/dueling-False/learning_rate_decay-0.96/batch_size-32/min_delta--1/max_reward-1.0/learn_start-50000.0/double_q-False/max_delta-1/scale-10000/random_start-30/cnn_format-NHWC/discount-0.99/min_reward--1.0/action_repeat-4/learning_rate_decay_step-50000/ep_start-1.0/history_length-4/target_q_update_step-10000/ep_end_t-1000000/train_frequency-4/max_step-50000000/screen_height-84/
`

How can this problem be fixed?

Answer 1 · 2018-02-27T10:44:46.000Z

I've the same problema