nnaisense/MAX

NaN in sampled next states

Trinkle23897 opened this issue · 4 comments

I ran the experiments several times, and almost each of the experiment would crash when the NaN was sampled.

The script I use is python3 main.py with max_explore env_noise_stdev=0.02.

And some of the logs:

➜  max git:(master) python3 main.py with max_explore env_noise_stdev=0.02
15:47:04 | WARNING | warn_if_unobserved | No observers have been added to this run
15:47:04 | INFO | _emit_started | Running command 'main'
15:47:04 | INFO | _emit_started | Started
Configuration (modified, added, typechanged, doc):
  action_noise_stdev = 0             # noise added to actions
  batch_size = 256                   # batch size for training models
  buffer_reuse = True                # transfer the main exploration buffer as off-policy samples to SAC
  checkpoint_frequency = 2000        # dump buffer with normalizer every checkpoint_frequency steps
  d_action = 6                       # dimensionality of action
  d_state = 19                       # dimensionality of state
  data_buffer_size = 20001           # size of the data buffer (FIFO queue)
  device = device(type='cuda')
  disable_cuda = False               # if true: do not ues cuda even though its available
  dump_dir = 'logs/20190708154704_25229'
  ensemble_size = 32                 # number of models in the bootstrap ensemble
  env_name = 'MagellanHalfCheetah-v2'    # environment out of the defined magellan environments with `Magellan` prefix
  env_noise_stdev = 0.02             # standard deviation of noise added to state
  eval_freq = 2000                   # interval in steps for evaluating models on tasks in the environment
  evaluation_model_epochs = 200      # number of training epochs for evaluating the tasks
  exploitation = False
  exploration_mode = 'active'        # active or reactive
  exploring_model_epochs = 50        # number of training epochs in each training phase during exploration
  grad_clip = 5                      # gradient clipping to train model
  learning_rate = 0.001              # learning rate for training models
  max_exploration = True
  model_train_freq = 25              # interval in steps for training models. if `np.inf`, models are trained after every episode
  n_eval_episodes = 3                # number of episodes evaluated for each task
  n_exploration_steps = 20000        # total number of steps (including warm up) of exploration
  n_hidden = 512                     # number of hidden units in each hidden layer (hidden layer size)
  n_layers = 4                       # number of hidden layers in the model (at least 2)
  n_warm_up_steps = 256              # number of steps to populate the initial buffer, actions selected randomly
  non_linearity = 'swish'            # activation function: can be 'leaky_relu' or 'swish'
  normalize_data = True              # normalize states, actions, next states to zero mean and unit variance
  omp_num_threads = 8                # for high CPU count machines
  policy_active_updates = 1          # number of SAC on-policy updates per step in the imagination/environment
  policy_actors = 128                # number of parallel actors in imagination MDP
  policy_batch_size = 4096           # SAC training batch size
  policy_exploit_alpha = 0.4         # entropy scaling factor in SAC for exploitation (task return maximisation)
  policy_exploit_episodes = 250      # number of iterations of SAC before each episode
  policy_exploit_horizon = 100       # length of sampled trajectories (planning horizon)
  policy_explore_alpha = 0.02        # entropy scaling factor in SAC for exploration (utility maximisation)
  policy_explore_episodes = 50       # number of iterations of SAC before each episode
  policy_explore_horizon = 50        # length of sampled trajectories (planning horizon)
  policy_gamma = 0.99                # discount factor for SAC
  policy_lr = 0.001                  # SAC learning rate
  policy_n_hidden = 256              # policy hidden size (2 layers)
  policy_reactive_updates = 100      # number of SAC off-policy updates of `batch_size`
  policy_replay_size = 10000000      # SAC replay size
  policy_tau = 0.005                 # soft target network update mixing factor
  policy_warm_up_episodes = 3        # number of episodes with random actions before SAC on-policy data is collected (as a part of init)
  random_exploration = False
  record = False                     # record videos of episodes (warning: could be slower and use up disk space)
  render = False                     # render the environment visually (warning: could open too many windows)
  renyi_decay = 0.1                  # decay to be used in calculating Renyi entropy
  save_eval_agents = False           # save evaluation agent (sac module objects)
  seed = 649736280                   # the random seed for this experiment
  self_dir = ''
  training_noise_stdev = 0           # standard deviation of training noise applied on states, actions, next states
  use_best_policy = False            # execute the best policy or the last one
  utility_action_norm_penalty = 0    # regularize to actions even when exploring
  utility_measure = 'renyi_div'      # measure for calculating exploration utility of a particular (state, action). 'cp_stdev', 'renyi_div'
  verbosity = 0                      # level of logging/printing on screen
  weight_decay = 0                   # L2 weight decay on model parameters (good: 1e-5, default: 0)
15:47:04 | INFO | do_max_exploration | step: 100	episode complete
15:47:04 | INFO | do_max_exploration | step: 200	episode complete
15:47:09 | INFO | fit_model | step: 256	 training done for 50 epochs, final loss: -1.071
15:47:18 | INFO | fit_model | step: 256	 training done for 200 epochs, final loss: -3.865
/data/git/max/imagination.py:49: UserWarning: NaN in sampled next states!
  warnings.warn("NaN in sampled next states!")
15:53:11 | ERROR | _emit_failed | Failed after 0:06:07!
Traceback (most recent calls WITHOUT Sacred internals):
  File "main.py", line 740, in main
    return do_max_exploration()
  File "main.py", line 645, in do_max_exploration
    average_performance = evaluate_tasks(buffer=buffer, step_num=step_num)
  File "main.py", line 496, in evaluate_tasks
    ep_return, ep_novelty = evaluate_task(env=env, model=model, buffer=buffer, task=task, render=render, filename=filename)
  File "main.py", line 448, in evaluate_task
    action, mdp, agent, _ = act(state=state, agent=agent, mdp=mdp, buffer=buffer, model=model, measure=task.measure, mode='exploit')
  File "main.py", line 385, in act
    ep_return = agent.episode(env=mdp, warm_up=warm_up, verbosity=verbosity, _log=_log)
  File "/data/git/max/sac.py", line 317, in episode
    self.replay.add(states, actions, rewards, next_states)
  File "/data/git/max/sac.py", line 82, in add
    self.masks[i:j] = masks
RuntimeError: The expanded size of the tensor (0) must match the existing size (128) at non-singleton dimension 0.  Target sizes: [0, 1].  Tensor sizes: [128, 1]
➜  max git:(master) python3 main.py with max_explore env_noise_stdev=0.02
15:53:49 | WARNING | warn_if_unobserved | No observers have been added to this run
15:53:49 | INFO | _emit_started | Running command 'main'
15:53:49 | INFO | _emit_started | Started
Configuration (modified, added, typechanged, doc):
  action_noise_stdev = 0             # noise added to actions
  batch_size = 256                   # batch size for training models
  buffer_reuse = True                # transfer the main exploration buffer as off-policy samples to SAC
  checkpoint_frequency = 2000        # dump buffer with normalizer every checkpoint_frequency steps
  d_action = 6                       # dimensionality of action
  d_state = 19                       # dimensionality of state
  data_buffer_size = 20001           # size of the data buffer (FIFO queue)
  device = device(type='cuda')
  disable_cuda = False               # if true: do not ues cuda even though its available
  dump_dir = 'logs/20190708155349_29100'
  ensemble_size = 32                 # number of models in the bootstrap ensemble
  env_name = 'MagellanHalfCheetah-v2'    # environment out of the defined magellan environments with `Magellan` prefix
  env_noise_stdev = 0.02             # standard deviation of noise added to state
  eval_freq = 2000                   # interval in steps for evaluating models on tasks in the environment
  evaluation_model_epochs = 200      # number of training epochs for evaluating the tasks
  exploitation = False
  exploration_mode = 'active'        # active or reactive
  exploring_model_epochs = 50        # number of training epochs in each training phase during exploration
  grad_clip = 5                      # gradient clipping to train model
  learning_rate = 0.001              # learning rate for training models
  max_exploration = True
  model_train_freq = 25              # interval in steps for training models. if `np.inf`, models are trained after every episode
  n_eval_episodes = 3                # number of episodes evaluated for each task
  n_exploration_steps = 20000        # total number of steps (including warm up) of exploration
  n_hidden = 512                     # number of hidden units in each hidden layer (hidden layer size)
  n_layers = 4                       # number of hidden layers in the model (at least 2)
  n_warm_up_steps = 256              # number of steps to populate the initial buffer, actions selected randomly
  non_linearity = 'swish'            # activation function: can be 'leaky_relu' or 'swish'
  normalize_data = True              # normalize states, actions, next states to zero mean and unit variance
  omp_num_threads = 8                # for high CPU count machines
  policy_active_updates = 1          # number of SAC on-policy updates per step in the imagination/environment
  policy_actors = 128                # number of parallel actors in imagination MDP
  policy_batch_size = 4096           # SAC training batch size
  policy_exploit_alpha = 0.4         # entropy scaling factor in SAC for exploitation (task return maximisation)
  policy_exploit_episodes = 250      # number of iterations of SAC before each episode
  policy_exploit_horizon = 100       # length of sampled trajectories (planning horizon)
  policy_explore_alpha = 0.02        # entropy scaling factor in SAC for exploration (utility maximisation)
  policy_explore_episodes = 50       # number of iterations of SAC before each episode
  policy_explore_horizon = 50        # length of sampled trajectories (planning horizon)
  policy_gamma = 0.99                # discount factor for SAC
  policy_lr = 0.001                  # SAC learning rate
  policy_n_hidden = 256              # policy hidden size (2 layers)
  policy_reactive_updates = 100      # number of SAC off-policy updates of `batch_size`
  policy_replay_size = 10000000      # SAC replay size
  policy_tau = 0.005                 # soft target network update mixing factor
  policy_warm_up_episodes = 3        # number of episodes with random actions before SAC on-policy data is collected (as a part of init)
  random_exploration = False
  record = False                     # record videos of episodes (warning: could be slower and use up disk space)
  render = False                     # render the environment visually (warning: could open too many windows)
  renyi_decay = 0.1                  # decay to be used in calculating Renyi entropy
  save_eval_agents = False           # save evaluation agent (sac module objects)
  seed = 644769808                   # the random seed for this experiment
  self_dir = ''
  training_noise_stdev = 0           # standard deviation of training noise applied on states, actions, next states
  use_best_policy = False            # execute the best policy or the last one
  utility_action_norm_penalty = 0    # regularize to actions even when exploring
  utility_measure = 'renyi_div'      # measure for calculating exploration utility of a particular (state, action). 'cp_stdev', 'renyi_div'
  verbosity = 0                      # level of logging/printing on screen
  weight_decay = 0                   # L2 weight decay on model parameters (good: 1e-5, default: 0)
15:53:49 | INFO | do_max_exploration | step: 100	episode complete
15:53:49 | INFO | do_max_exploration | step: 200	episode complete
15:53:54 | INFO | fit_model | step: 256	 training done for 50 epochs, final loss: -1.09
15:54:03 | INFO | fit_model | step: 256	 training done for 200 epochs, final loss: -3.877
16:02:38 | INFO | evaluate_tasks | task: running	episode: 1	reward: -57.8515
/data/git/max/imagination.py:49: UserWarning: NaN in sampled next states!
  warnings.warn("NaN in sampled next states!")
16:05:04 | ERROR | _emit_failed | Failed after 0:11:16!
Traceback (most recent calls WITHOUT Sacred internals):
  File "main.py", line 740, in main
    return do_max_exploration()
  File "main.py", line 645, in do_max_exploration
    average_performance = evaluate_tasks(buffer=buffer, step_num=step_num)
  File "main.py", line 496, in evaluate_tasks
    ep_return, ep_novelty = evaluate_task(env=env, model=model, buffer=buffer, task=task, render=render, filename=filename)
  File "main.py", line 448, in evaluate_task
    action, mdp, agent, _ = act(state=state, agent=agent, mdp=mdp, buffer=buffer, model=model, measure=task.measure, mode='exploit')
  File "main.py", line 385, in act
    ep_return = agent.episode(env=mdp, warm_up=warm_up, verbosity=verbosity, _log=_log)
  File "/data/git/max/sac.py", line 317, in episode
    self.replay.add(states, actions, rewards, next_states)
  File "/data/git/max/sac.py", line 82, in add
    self.masks[i:j] = masks
RuntimeError: The expanded size of the tensor (0) must match the existing size (128) at non-singleton dimension 0.  Target sizes: [0, 1].  Tensor sizes: [128, 1]

I don't know what's going wrong. Could you please help me?

neale commented

I'm having the same issue. I haven't fixed it yet.

  • Mujoco 150
  • PyTorch 1.1.0
  • standard half cheetah experiment

I attached torch.autograd.detect_anomaly() to the main training loop, and it detected a NaN error in the value prediction network of SAC here.
Since you later mask the NaNs and Infs, it's hard to say if that's related or not. It seems unlikely with the gradient clipping, that the forward models would completely and suddenly diverge/collapse. It seems like more likely that there is an issue with the NaN/Inf masking -- and we're predicting next state from a NaN/Inf state.
What confuses me is that there is a check and a warning for this in imagination.py, but no way to safely handle it.

I have experienced this issue only once before and after investigating I convinced myself that it is due to "bad luck" in the warm-up data, so I am surprised that you experienced that you say that "almost each of the experiment would crash". The underlying problem is that the policy exploits the models "dreaming" absurdly high rewards, which leads, presumably, to huge V/Q-values and infs/nans, eventually. If we use verbosity=3, you can see something like that:

13:09:46 | INFO | episode | step_reward. mean: 24353.95 +- 150840.66 [-2.67, 1368097.62]         
13:09:46 | INFO | episode | step_reward. mean: 28201.39 +- 190058.91 [-2.59, 1855795.25]         
13:09:46 | INFO | episode | step_reward. mean: 35032.18 +- 254233.11 [-3.30, 2684573.50]
13:09:46 | INFO | episode | step_reward. mean: 23426.37 +- 140640.00 [-3.38, 1202750.00]
13:09:46 | INFO | episode | step_reward. mean: 33987.25 +- 220136.44 [-3.05, 2216282.00]
13:09:47 | INFO | episode | step_reward. mean: 51998.88 +- 368336.62 [-2.80, 3857665.00]
13:09:47 | INFO | episode | step_reward. mean: 84999.31 +- 573147.12 [-2.47, 5745555.00]
13:09:47 | INFO | episode | step_reward. mean: 121692.67 +- 871324.50 [-2.70, 8948482.00]
13:09:47 | INFO | episode | step_reward. mean: 125470.11 +- 982328.94 [-127924.93, 10083945.00]
13:09:48 | INFO | episode | step_reward. mean: 187612.44 +- 1410389.38 [-2.86, 14323131.00]
13:09:48 | INFO | episode | step_reward. mean: 287003.62 +- 1920799.75 [-2.76, 15664406.00]
13:09:48 | INFO | episode | step_reward. mean: 278497.91 +- 1884108.25 [-2.89, 15320821.00]
13:09:48 | INFO | episode | step_reward. mean: 334041.50 +- 2474476.00 [-2.59, 26274620.00]

I am sure that playing with the hyperparams should prevent that from happening. I think, increasing the warm up to 1024 samples or increasing the policy_alpha slightly should already help.

To be precise, the code is this repo is not 100% the same code that was used to run the experiments for the paper. This one contains one bug fix. In the original code we were making one spurious tanh(action), effectively reducing the [-1,1] action space of the environment. (This was affecting all the algorithms, so it didn't matter for the relative comparison between them described in the paper; that is why we did not need to redo the experiments and did not hit the NaNs issue.) But as a side-effect, this bug, apparently, was also limiting the possible exploitation of the model.

If you wish to exactly reproduce the results in the paper, here is the full diff to apply:

diff repo_max/imagination.py paper_max/imagination.py
42c42
<             next_state_means, next_state_vars = self.model.forward_all(self.states, actions)    # shape: (n_actors, ensemble_size, d_state)
---
>             next_state_means, next_state_vars = self.model.forward_all(self.states, actions)    # shape: both (ensemble_size, n_actors, d_action)
diff repo_max/main.py paper_max/main.py
130c130
<     use_best_policy = False                         # execute the best policy or the last one
---
>     use_best_policy = False                         # transfer the main exploration buffer as off-policy samples to SAC
251d250
<         loss.backward()
252a252
>         loss.backward()
269,270d268
<         if verbosity >= 2:
<             _log.info(f'epoch: {epoch_i:3d} training_loss: {tr_loss:.2f}')
376c374
<         # to be fair to reactive methods, clear real env data in SAC buffer, to prevent further gradient updates from it.
---
>         # to be fair to reactive methods, clear real env data in buffer, to prevent further gradient updates from it
385c383
<             ep_return = agent.episode(env=mdp, warm_up=warm_up, verbosity=verbosity, _log=_log)
---
>             ep_return = agent.episode(env=mdp, warm_up=warm_up)
419,420c417
< @ex.capture
< def transition_novelty(state, action, next_state, model, renyi_decay):
---
> def transition_novelty(state, action, next_state, model):
427c424
<     measure = JensenRenyiDivergenceUtilityMeasure(decay=renyi_decay)
---
>     measure = JensenRenyiDivergenceUtilityMeasure(decay=0.1)
433c430
< def evaluate_task(env, model, buffer, task, render, filename, record, save_eval_agents, verbosity, _run, _log):
---
> def evaluate_task(env, model, buffer, task, render, filename, record, save_eval_agents, _run):
451,452c448
<         n = transition_novelty(state, action, next_state, model=model)
<         novelty.append(n)
---
>         novelty.append(transition_novelty(state, action, next_state, model=model))
455,456d450
<         if verbosity >= 3:
<             _log.info(f'reward: {reward:5.2f} trans_novelty: {n:5.2f} action: {action}')
479,486d472
<     # Uncomment for exploration coverage in ant
<     #from envs.ant import rate_buffer
<     #coverage = rate_buffer(buffer=buffer)
<     #_run.log_scalar("coverage", coverage, step_num)
<     #_run.result = coverage
<     #_log.info(f"coverage: {coverage}")
<     #return coverage
< 
688,689d673
<     checkpoint(buffer=buffer, step_num=n_exploration_steps)
< 
698a683
> 
diff repo_max/models.py paper_max/models.py
129a130,131
>         actions = torch.tanh(actions)
> 
Only in repo_max: readme.md
diff repo_max/sac.py paper_max/sac.py
32d31
<         self.ptr = 0
33a33
>         self.ptr = 0
39d38
<         self.buffer_full = False
51d49
<         self.buffer_full = False
64,67d61
<         
<         # skip ones with NaNs and Infs
<         skip_mask = danger_mask(states) + danger_mask(actions) + danger_mask(rewards) + danger_mask(next_states)
<         include_mask = (skip_mask == 0)
69d62
<         n_samples = torch.sum(include_mask).item()
73d65
<             self.buffer_full = True
76c68,72
<         j = self.ptr + n_samples
---
> 
>         # skip ones with NaNs and Infs
>         skip_mask = danger_mask(states) + danger_mask(actions) + danger_mask(rewards) + danger_mask(next_states)
>         include_mask = (skip_mask == 0)
>         j = self.ptr + torch.sum(include_mask).item()
87c83
<         idxs = np.random.randint(len(self), size=batch_size)
---
>         idxs = np.random.randint(self.ptr, size=batch_size)
94,98d89
<     def __len__(self):
<         if self.buffer_full:
<             return self.size
<         return self.ptr
< 
303c294
<     def episode(self, env, warm_up=False, train=True, verbosity=0, _log=None):
---
>     def episode(self, env, warm_up=False, train=True):
318,320d308
<             if verbosity >= 3 and _log is not None:
<                 _log.info(f'step_reward. mean: {torch.mean(rewards).item():5.2f} +- {torch.std(rewards).item():.2f} [{torch.min(rewards).item():5.2f}, {torch.max(rewards).item():5.2f}]')
< 
diff repo_max/wrappers.py paper_max/wrappers.py
13c13
<         action = np.clip(action, -1., 1.)
---
>         action = np.tanh(action)

Thanks! Actually, my following runs do not collapse. Just have more tries.

This repo's code has the probability for the crash. After setting verbosity=3 the reward seems not converge. Could you please update the final version to the master branch?

11:01:07 | INFO | episode | step_reward. mean: 86085883396096.00 +- 829155918741504.00 [-298888250523648.00, 9337457543741440.00]
11:01:07 | INFO | episode | step_reward. mean: -21670574161920.00 +- 438323323600896.00 [-4848044193349632.00, 699136185729024.00]
11:01:07 | INFO | episode | step_reward. mean: -248891307982848.00 +- 3394395332149248.00 [-38263421258432512.00, 1774258237734912.00]
11:01:07 | INFO | episode | step_reward. mean: -143906335358976.00 +- 2610979338715136.00 [-28923108635181056.00, 4401228008128512.00]
11:01:07 | INFO | episode | step_reward. mean: 1921059548823552.00 +- 20858781403447296.00 [-1080617462661120.00, 235933342527127552.00]
11:01:07 | INFO | episode | step_reward. mean: 2565822589435904.00 +- 26671938034204672.00 [-2312866195570688.00, 301287592127627264.00]
11:01:08 | INFO | act | 	ep: 63	average step return: 41819492609799.95
/data/git/max/imagination.py:49: UserWarning: NaN in sampled next states!
  warnings.warn("NaN in sampled next states!")
11:01:08 | ERROR | _emit_failed | Failed after 0:02:29!
Traceback (most recent calls WITHOUT Sacred internals):
  File "main.py", line 740, in main
    return do_max_exploration()
  File "main.py", line 645, in do_max_exploration
    average_performance = evaluate_tasks(buffer=buffer, step_num=step_num)
  File "main.py", line 496, in evaluate_tasks
    ep_return, ep_novelty = evaluate_task(env=env, model=model, buffer=buffer, task=task, render=render, filename=filename)
  File "main.py", line 448, in evaluate_task
    action, mdp, agent, _ = act(state=state, agent=agent, mdp=mdp, buffer=buffer, model=model, measure=task.measure, mode='exploit')
  File "main.py", line 385, in act
    ep_return = agent.episode(env=mdp, warm_up=warm_up, verbosity=verbosity, _log=_log)
  File "/data/git/max/sac.py", line 317, in episode
    self.replay.add(states, actions, rewards, next_states)
  File "/data/git/max/sac.py", line 82, in add
    self.masks[i:j] = masks
RuntimeError: The expanded size of the tensor (0) must match the existing size (128) at non-singleton dimension 0.  Target sizes: [0, 1].  Tensor sizes: [128, 1]

To see whether this is really happening, I executed 4 runs with the current master with more warmup steps for higher stability. I used the following command:

main.py with max_explore env_noise_stdev=0.02 n_warm_up_steps=1024

and found no NaN problems.

If somebody still has any problems, use the current branch and provide the config seeds used, please.