Multi-GPU with Central Value not working

Question

Multi-GPU with Central Value not working

Closed this issue 2 years ago · 0 comments

Trying to run multi-gpu training with horovod, I get the following error:

[1,1]<stderr>:/opt/conda/lib/python3.8/site-packages/horovod/torch/sync_batch_norm.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
[1,1]<stderr>:  LooseVersion(torch.__version__) >= LooseVersion('1.5.0') and
[1,1]<stderr>:/opt/conda/lib/python3.8/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
[1,1]<stderr>:  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
[1,1]<stderr>:/workspace/isaacgymenvs/isaacgymenvs/tasks/allegro_hand.py:275: DeprecationWarning: an integer is required (got type isaacgym._bindings.linux-x86_64.gym_38.DofDriveMode).  Implicit conversion to integers using __int__ is deprecated, and may be removed in a future version of Python.
[1,1]<stderr>:  asset_options.default_dof_drive_mode = gymapi.DOF_MODE_POS
[1,1]<stderr>:/opt/conda/lib/python3.8/site-packages/horovod/common/util.py:227: DeprecationWarning: Parameter `average` has been replaced with `op` and will be removed in v0.21.0
[1,1]<stderr>:  warnings.warn('Parameter `average` has been replaced with `op` and will be removed in v0.21.0',
[1,1]<stderr>:Error executing job with overrides: ['task=AllegroHandLSTM', 'headless=True', 'multi_gpu=True', 'train.params.config.mixed_precision=False']
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "train.py", line 137, in launch_rlg_hydra
[1,1]<stderr>:    runner.run({
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/torch_runner.py", line 97, in run
[1,1]<stderr>:    self.run_train(args)
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/torch_runner.py", line 78, in run_train
[1,1]<stderr>:    agent.train()
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1141, in train
[1,1]<stderr>:    step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1012, in train_epoch
[1,1]<stderr>:    self.train_central_value()
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 516, in train_central_value
[1,1]<stderr>:    return self.central_value_net.train_net()
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 194, in train_net
[1,1]<stderr>:    self.update_lr(self.lr)
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 79, in update_lr
[1,1]<stderr>:    self.hvd.broadcast_value(lr_tensor, 'cv_learning_rate')
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
[1,1]<stderr>:    raise AttributeError("'{}' object has no attribute '{}'".format(
[1,1]<stderr>:AttributeError: 'CentralValueTrain' object has no attribute 'hvd'
[1,1]<stderr>:
[1,1]<stderr>:Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[1,0]<stderr>:/opt/conda/lib/python3.8/site-packages/horovod/torch/sync_batch_norm.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
[1,0]<stderr>:  LooseVersion(torch.__version__) >= LooseVersion('1.5.0') and
[1,0]<stderr>:/opt/conda/lib/python3.8/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
[1,0]<stderr>:  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
[1,0]<stderr>:/workspace/isaacgymenvs/isaacgymenvs/tasks/allegro_hand.py:275: DeprecationWarning: an integer is required (got type isaacgym._bindings.linux-x86_64.gym_38.DofDriveMode).  Implicit conversion to integers using __int__ is deprecated, and may be removed in a future version of Python.
[1,0]<stderr>:  asset_options.default_dof_drive_mode = gymapi.DOF_MODE_POS
[1,0]<stderr>:/opt/conda/lib/python3.8/site-packages/horovod/common/util.py:227: DeprecationWarning: Parameter `average` has been replaced with `op` and will be removed in v0.21.0
[1,0]<stderr>:  warnings.warn('Parameter `average` has been replaced with `op` and will be removed in v0.21.0',
[1,0]<stderr>:Error executing job with overrides: ['task=AllegroHandLSTM', 'headless=True', 'multi_gpu=True', 'train.params.config.mixed_precision=False']
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "train.py", line 137, in launch_rlg_hydra
[1,0]<stderr>:    runner.run({
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/torch_runner.py", line 97, in run
[1,0]<stderr>:    self.run_train(args)
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/torch_runner.py", line 78, in run_train
[1,0]<stderr>:    agent.train()
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1141, in train
[1,0]<stderr>:    step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1012, in train_epoch
[1,0]<stderr>:    self.train_central_value()
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 516, in train_central_value
[1,0]<stderr>:    return self.central_value_net.train_net()
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 194, in train_net
[1,0]<stderr>:    self.update_lr(self.lr)
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 79, in update_lr
[1,0]<stderr>:    self.hvd.broadcast_value(lr_tensor, 'cv_learning_rate')
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
[1,0]<stderr>:    raise AttributeError("'{}' object has no attribute '{}'".format(
[1,0]<stderr>:AttributeError: 'CentralValueTrain' object has no attribute 'hvd'
[1,0]<stderr>:
[1,0]<stderr>:Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[1,0]<stdout>:[1,1]<stdout>:--------------------------------------------------------------------------

It seems that central value module never creates or receives Horovod wrapper object.