IndexError during training
rohjunha opened this issue · 5 comments
I have run a command
srun --gres=gpu:4 --mem=128GB -c 16 python habitat_baselines/run.py --exp-config habitat_baselines/config/multinav/ppo_multinav.yaml --agent-type oracle-ego --run-type train
with slurm
and I got an index error as below:
...
I0815 22:20:39.351375 6788 simulator.py:145] Loaded navmesh data/scene_datasets/mp3d/B6ByNegPMKs/B6ByNegPMKs.navmesh [14/4757]
2021-08-15 22:20:39,364 Initializing task MultiNav-v1
2021-08-15 22:20:51,003 agent number of parameters: 20248389
/home/mai/jroh/miniconda3/envs/mon2/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later,
you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more de$
ails at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
/home/mai/jroh/projects/multion/habitat_baselines/rl/models/rnn_state_encoder.py:105: UserWarning: This overload of nonzero is deprecated:
nonzero()
Consider using one of the following signatures instead:
nonzero(*, bool as_tuple) (Triggered internally at /opt/conda/conda-bld/pytorch_1607370117127/work/torch/csrc/utils/python_arg_parser.cpp:882.)
has_zeros = (masks[1:] == 0.0).any(dim=-1).nonzero().squeeze().cpu()
Traceback (most recent call last):
File "habitat_baselines/run.py", line 91, in <module>
main()
File "habitat_baselines/run.py", line 43, in main
run_exp(**vars(args))
File "habitat_baselines/run.py", line 86, in run_exp
trainer.train()
File "/home/mai/jroh/projects/multion/habitat_baselines/rl/ppo/ppo_trainer.py", line 1231, in train
) = self._update_agent(ppo_cfg, rollouts)
File "/home/mai/jroh/projects/multion/habitat_baselines/rl/ppo/ppo_trainer.py", line 1124, in _update_agent
value_loss, action_loss, dist_entropy = self.agent.update(rollouts)
File "/home/mai/jroh/projects/multion/habitat_baselines/rl/ppo/ppo.py", line 232, in update
for sample in data_generator:
File "/home/mai/jroh/projects/multion/habitat_baselines/common/rollout_storage.py", line 422, in recurrent_generator
ind = perm[start_ind + offset]
IndexError: index 18 is out of bounds for dimension 0 with size 18
I am using python 3.8.11, pytorch 1.7.1, cudatoolkit 10.1.
I'm sorry, but did you solve this problem?
I want to know the solution, too
Can you please try changing NUM_PROCESSES from 18 to 16 here? Let us know if that works.
sonia-raychaudhuri, it works well! Thank you. By the way I left a question at "Embodied AI workshop" Slack channel, and you gave me reply. And thank you for your reference as I left a reply.
num_processes
and num_mini_batch
define how many environments to keep in each mini batch here. If num_processes
is not a multiple of num_mini_batch
, this will give an error. I believe your num_mini_batch
is set to 4. So setting num_processes
to 16 or 20, as opposed to 18 will do the trick.