younggyoseo/MWM

Raise this error when run DMControl experiments(use demonstration)

Lyn-Qiu opened this issue · 4 comments

│ /home/lc/.local/lib/python3.8/site-packages/keras/mixed_precision/au ││ tocast_variable.py:244 in assign │
│ ││ 241 │ return update_var │
│ 242 ││ 243 def assign(self, value, use_locking=None, name=None, read_valu │
│ ❱ 244 │ return self._apply_assign_update(self.variable.assign, valu ││ 245 │ │ │ │ │ │ │ │ │ name, read_value) │
│ 246 ││ 247 def assign_add(self, delta, use_locking=None, name=None, read

│ ││ /home/lync/.local/lib/python3.8/site-packages/keras/mixed_precision/au │
│ tocast_variable.py:217 in _apply_assign_update ││ │
│ 214 │ # DistributedVariable.assign returns a DistributedVariable. │
│ 215 │ # MirroredStrategy, it returns a Mirrored value. │
│ 216 │ if tf.compat.v1.executing_eagerly_outside_functions(): │
│ ❱ 217 │ assign_op = update_fn(value, use_locking, name, False) │
│ 218 │ if read_value: │
│ 219 │ │ # We create a new AutoCastVariable with the same underly │
│ 220 │ │ # The new AutoCastVariable is identical except the 'op' │
│ │
│ /home/lc/.local/lib/python3.8/site-packages/tensorflow/python/ops/re │
│ source_variable_ops.py:899 in assign │
│ │
│ 896 │ │ tensor_name = "" │
│ 897 │ │ else: │
│ 898 │ │ tensor_name = " " + str(self.name) │
│ ❱ 899 │ │ raise ValueError( │
│ 900 │ │ │ ("Cannot assign to variable%s due to variable shape │
│ 901 │ │ │ "shape %s are incompatible") % │
│ 902 │ │ │ (tensor_name, self._shape, value_tensor.shape)) │
╰────────────────────────────────────────────────────────────────────────╯
ValueError: Cannot assign to variable dense_12/kernel:0 due to variable shape (512, 9) and value shape (512, 4) are incompatible

Hi,

  1. Could you let me know detailed script you used?
  2. Does the script work with new log directory? For instance, when you use --logdir logs, logs directory should be empty
  3. What do you mean by 'use demonstration'? Does the demonstration and the task you are planning to use have the same action spaces?

Hi,

  1. Could you let me know detailed script you used?
  2. Does the script work with new log directory? For instance, when you use --logdir logs, logs directory should be empty
  3. What do you mean by 'use demonstration'? Does the demonstration and the task you are planning to use have the same action spaces?

Thanks for reply,

  1. Use "TF_XLA_FLAGS=--tf_xla_auto_jit=2 python mwm/train.py --logdir logs --configs dmc_vision --task dmc_manip_reach_duplo --steps 252000 --mae.reward_pred True --mae.early_conv True" as ReadMe, and I could run scipt of metaworld successfully
  2. Yes, logs is not empty
  3. use demonstration means I use "TF_XLA_FLAGS=--tf_xla_auto_jit=2 python mwm/train.py --logdir logs --configs dmc_vision --task dmc_manip_reach_duplo --steps 252000 --mae.reward_pred True --mae.early_conv True"

And now I seems solve this problem by using --logdir logs3 ? I ran the meatworld code with --logdir logs, don't know if it is the problem?

But it seems that the code is stuck here:
[9250] train_return 15.08 / train_length 125 / train_total_steps 4625 / train_total_episodes 37 / train_loaded_steps 4625 / train_loaded_episodes 37
Train episode has 125 steps and return 0.0.
[9500] train_return 1.8e-18 / train_length 125 / train_total_steps 4750 / train_total_episodes 38 / train_loaded_steps 4750 / train_loaded_episodes 38
Train episode has 125 steps and return 0.0.
[9750] train_return 1.2e-12 / train_length 125 / train_total_steps 4875 / train_total_episodes 39 / train_loaded_steps 4875 / train_loaded_episodes 39
Train episode has 125 steps and return 0.0.
[10000] train_return 2.4e-14 / train_length 125 / train_total_steps 5000 / train_total_episodes 40 / train_loaded_steps 5000 / train_loaded_episodes 40
Eval episode has 125 steps and return 0.0.
[10000] eval_return 5.4e-13 / eval_length 125 / eval_total_steps 0 / eval_total_episodes 0 / eval_loaded_steps 0 / eval_loaded_episodes 0
Create agent.
Found 5843809 mae parameters.
Found 14690433 model parameters.
Found 1846290 actor parameters.
Found 1837569 critic parameters.

Explanation on bug
I see, the reason for your bug was that the code automatically loads the previous trajectories in logs directory, so for new experiments you should use different log directory, and that's the reason why using --logdir log3 solved the problem.

Reason for being stuck
Actually, it's not stuck but it's natural to take several minutes before the code runs due to XLA optimization. For debugging purpose to see whether code runs on your machine, you can run python mwm/train.py --logdir logs --configs dmc_vision --task dmc_manip_reach_duplo --steps 252000 --mae.reward_pred True --mae.early_conv True --dataset.batch 8 --mae_dataset.batch 8 without XLA optimization and with small batch sizes (not to overflow memory without XLA optimization)

Get it, Thanks a lot~