saimwani/multiON

IndexError during training

rohjunha opened this issue · 5 comments

I have run a command

srun --gres=gpu:4 --mem=128GB -c 16 python habitat_baselines/run.py --exp-config habitat_baselines/config/multinav/ppo_multinav.yaml --agent-type oracle-ego --run-type train

with slurm and I got an index error as below:

...
I0815 22:20:39.351375 6788 simulator.py:145] Loaded navmesh data/scene_datasets/mp3d/B6ByNegPMKs/B6ByNegPMKs.navmesh                                                                               [14/4757]
2021-08-15 22:20:39,364 Initializing task MultiNav-v1
2021-08-15 22:20:51,003 agent number of parameters: 20248389
/home/mai/jroh/miniconda3/envs/mon2/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, 
you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more de$
ails at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
/home/mai/jroh/projects/multion/habitat_baselines/rl/models/rnn_state_encoder.py:105: UserWarning: This overload of nonzero is deprecated:
        nonzero()
Consider using one of the following signatures instead:
        nonzero(*, bool as_tuple) (Triggered internally at  /opt/conda/conda-bld/pytorch_1607370117127/work/torch/csrc/utils/python_arg_parser.cpp:882.)
  has_zeros = (masks[1:] == 0.0).any(dim=-1).nonzero().squeeze().cpu()
Traceback (most recent call last):
  File "habitat_baselines/run.py", line 91, in <module>
    main()
  File "habitat_baselines/run.py", line 43, in main
    run_exp(**vars(args))
  File "habitat_baselines/run.py", line 86, in run_exp
    trainer.train()
  File "/home/mai/jroh/projects/multion/habitat_baselines/rl/ppo/ppo_trainer.py", line 1231, in train
    ) = self._update_agent(ppo_cfg, rollouts)
  File "/home/mai/jroh/projects/multion/habitat_baselines/rl/ppo/ppo_trainer.py", line 1124, in _update_agent
    value_loss, action_loss, dist_entropy = self.agent.update(rollouts)
  File "/home/mai/jroh/projects/multion/habitat_baselines/rl/ppo/ppo.py", line 232, in update
    for sample in data_generator:
  File "/home/mai/jroh/projects/multion/habitat_baselines/common/rollout_storage.py", line 422, in recurrent_generator
    ind = perm[start_ind + offset]
IndexError: index 18 is out of bounds for dimension 0 with size 18

I am using python 3.8.11, pytorch 1.7.1, cudatoolkit 10.1.

I'm sorry, but did you solve this problem?

I want to know the solution, too

Can you please try changing NUM_PROCESSES from 18 to 16 here? Let us know if that works.

sonia-raychaudhuri, it works well! Thank you. By the way I left a question at "Embodied AI workshop" Slack channel, and you gave me reply. And thank you for your reference as I left a reply.

num_processes and num_mini_batch define how many environments to keep in each mini batch here. If num_processes is not a multiple of num_mini_batch, this will give an error. I believe your num_mini_batch is set to 4. So setting num_processes to 16 or 20, as opposed to 18 will do the trick.