Trying to run on cuda:1 crashes

Question

Trying to run on cuda:1 crashes

Closed this issue a year ago · 4 comments

**I have 2 GPU's and I want to only train on the second one so I ran:

python train.py task=Cartpole rl_device='cuda:1' sim_device='cuda:1'

it crashes saying I am still running something on cuda:0. Any ideas how to fix this?

here is the full stack trace:**

(rlenv) bizon@dl:~/eric/IsaacGymEnvs-main/isaacgymenvs$ python train.py task=Cartpole rl_device='cuda:1' sim_device='cuda:1'
Importing module 'gym_38' (/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/isaacgym/_bindings/linux-x86_64/gym_38.so)
Setting GYM_USD_PLUG_INFO_PATH to /home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/isaacgym/_bindings/linux-x86_64/usd/plugInfo.json
train.py:49: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
@hydra.main(config_name="config", config_path="./cfg")
/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'config': Defaults list is missing _self_. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
warnings.warn(msg, UserWarning)
/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/hydra/_internal/defaults_list.py:415: UserWarning: In config: Invalid overriding of hydra/job_logging:
Default list overrides requires 'override' keyword.
See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/defaults_list_override for more information.

deprecation_warning(msg)
/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
PyTorch version 1.13.1
Device count 2
/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/isaacgym/_bindings/src/gymtorch
Using /home/bizon/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /home/bizon/.cache/torch_extensions/py38_cu117/gymtorch/build.ninja...
Building extension module gymtorch...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module gymtorch...
/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/isaacgym/torch_utils.py:135: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def get_axis_params(value, axis_idx, x_value=0., dtype=np.float, n_dims=3):
2023-04-14 09:06:54,989 - INFO - logger - logger initialized
:3: DeprecationWarning: invalid escape sequence *
Error: FBX library failed to load - importing FBX data will not succeed. Message: No module named 'fbx'
FBX tools must be installed from https://help.autodesk.com/view/FBX/2020/ENU/?guid=FBX_Developer_Help_scripting_with_python_fbx_installing_python_fbx_html
/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/torch/utils/tensorboard/init.py:4: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if not hasattr(tensorboard, "version") or LooseVersion(
/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:568: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
(np.object, string),
/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:569: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
(np.bool, bool),
/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/tensorboard/util/tensor_util.py:100: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
np.object: SlowAppendObjectArrayToTensorProto,
/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/tensorboard/util/tensor_util.py:101: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
np.bool: SlowAppendBoolArrayToTensorProto,
task:
name: Cartpole
physics_engine: physx
env:
numEnvs: 512
envSpacing: 4.0
resetDist: 3.0
maxEffort: 400.0
clipObservations: 5.0
clipActions: 1.0
asset:
assetRoot: ../../assets
assetFileName: urdf/cartpole.urdf
enableCameraSensors: False
sim:
dt: 0.0166
substeps: 2
up_axis: z
use_gpu_pipeline: True
gravity: [0.0, 0.0, -9.81]
physx:
num_threads: 4
solver_type: 1
use_gpu: True
num_position_iterations: 4
num_velocity_iterations: 0
contact_offset: 0.02
rest_offset: 0.001
bounce_threshold_velocity: 0.2
max_depenetration_velocity: 100.0
default_buffer_size_multiplier: 2.0
max_gpu_contact_pairs: 1048576
num_subscenes: 4
contact_collection: 0
task:
randomize: False
train:
params:
seed: 42
algo:
name: a2c_continuous
model:
name: continuous_a2c_logstd
network:
name: actor_critic
separate: False
space:
continuous:
mu_activation: None
sigma_activation: None
mu_init:
name: default
sigma_init:
name: const_initializer
val: 0
fixed_sigma: True
mlp:
units: [32, 32]
activation: elu
initializer:
name: default
regularizer:
name: None
load_checkpoint: False
load_path:
config:
name: Cartpole
full_experiment_name: Cartpole
env_name: rlgpu
ppo: True
mixed_precision: False
normalize_input: True
normalize_value: True
num_actors: 512
reward_shaper:
scale_value: 0.1
normalize_advantage: True
gamma: 0.99
tau: 0.95
learning_rate: 0.0003
lr_schedule: adaptive
kl_threshold: 0.008
score_to_win: 20000
max_epochs: 100
save_best_after: 50
save_frequency: 25
grad_norm: 1.0
entropy_coef: 0.0
truncate_grads: True
e_clip: 0.2
horizon_length: 16
minibatch_size: 8192
mini_epochs: 8
critic_coef: 4
clip_value: True
seq_len: 4
bounds_loss_coef: 0.0001
task_name: Cartpole
experiment:
num_envs:
seed: 42
torch_deterministic: False
max_iterations:
physics_engine: physx
pipeline: gpu
sim_device: cuda:1
rl_device: cuda:1
graphics_device_id: 0
num_threads: 4
solver_type: 1
num_subscenes: 4
test: False
checkpoint:
multi_gpu: False
wandb_activate: False
wandb_group:
wandb_name: Cartpole
wandb_entity:
wandb_project: isaacgymenvs
capture_video: False
capture_video_freq: 1464
capture_video_len: 100
force_render: True
headless: False
Setting seed: 42
self.seed = 42
Started to train
Exact experiment name requested from command line: Cartpole
/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
[Warning] [carb.gym.plugin] useGpu is set, forcing single scene (0 subscenes)
Not connected to PVD
+++ Using GPU PhysX
Physics Engine: PhysX
Physics Device: cuda:1
GPU Pipeline: enabled
Box(-1.0, 1.0, (1,), float32) Box(-inf, inf, (4,), float32)
current training device: cuda:0
build mlp: 4
RunningMeanStd: (1,)
RunningMeanStd: (4,)
Error executing job with overrides: ['task=Cartpole', 'rl_device=cuda:1', 'sim_device=cuda:1']
Traceback (most recent call last):
File "train.py", line 161, in launch_rlg_hydra
runner.run({
File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/torch_runner.py", line 120, in run
self.run_train(args)
File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/torch_runner.py", line 101, in run_train
agent.train()
File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1173, in train
step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1037, in train_epoch
batch_dict = self.play_steps()
File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 626, in play_steps
res_dict = self.get_action_values(self.obs)
File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 348, in get_action_values
res_dict = self.model(input_dict)
File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/algos_torch/models.py", line 246, in forward
input_dict['obs'] = self.norm_obs(input_dict['obs'])
File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/algos_torch/models.py", line 49, in norm_obs
return self.running_mean_std(observation) if self.normalize_input else observation
File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bizon/anaconda3/envs/rlenv/lib/python3.8/site-packages/rl_games/algos_torch/running_mean_std.py", line 79, in forward
y = (input - current_mean.float()) / torch.sqrt(current_var.float() + self.epsilon)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

Answer 1 · 2023-04-25T03:07:07.000Z

Hi, I encountered the same issue, and according to #109, it is because rl_device='cuda:1' doesn't work correctly.

you can either follow their solution or simply add CUDA_VISIBLE_DEVICES=[gpu_ids] infront of your training command.

Answer 2 · 2023-04-30T01:11:06.000Z

I just tried

CUDA_VISIBLE_DEVICES=1, python train.py task=Cartpole
CUDA_VISIBLE_DEVICES=1, python train.py task=Cartpole rl_device='cuda:1' sim_device='cuda:1'
CUDA_VISIBLE_DEVICES=[1], python train.py task=Cartpole rl_device='cuda:1' sim_device='cuda:1'
CUDA_VISIBLE_DEVICES=[1], python train.py task=Cartpole

None of these work. It still crashes. I tried just using export as well. Were you able to get it to work?

Answer 3 · 2023-05-09T12:30:31.000Z

@Robokan If you use CUDA_VISIBLE_DEVICES=1, you need to use cuda:0 instead of cuda:1 since you now have only one GPU exposed

Answer 4 · 2023-05-12T20:33:31.000Z

Great that works. Thanks for the clarification.