RuntimeError: CUDA error: invalid device ordinal

Question

RuntimeError: CUDA error: invalid device ordinal

Opened this issue 10 months ago · 1 comments

python multitask_train.py ../ignite_dataset/ --task-list-file datasets/ignite_data.json --num_epochs 101 --save-dir ignite_models

2023-12-28 18:07:53: Starting train run FSMol_5e-05_256_4_1024_10_True_64_2_Multitask_2023-12-28_18-07-53.
2023-12-28 18:07:53:    Arguments: Namespace(DATA_PATH='../ignite_dataset/', task_list_file='datasets/ignite_data.json', save_dir='ignite_models', seed=0, azureml_logging=False, gnn_type='PNA', node_embed_dim=128, num_heads=4, per_head_dim=64, intermediate_dim=1024, message_function_depth=1, num_gnn_layers=10, readout_type='combined', readout_use_all_states=True, readout_num_heads=12, readout_head_dim=64, readout_output_dim=512, num_tail_layers=2, batch_size=256, num_epochs=101, patience=10, cuda=5, learning_rate=5e-05, metric_to_use='avg_precision', task_specific_lr=0.0001, finetune_lr_scale=1.0)
2023-12-28 18:07:53:    Output dir: ignite_models/FSMol_5e-05_256_4_1024_10_True_64_2_Multitask_2023-12-28_18-07-53
2023-12-28 18:07:53:    Data path: ../ignite_dataset/
2023-12-28 18:07:53: Identified 0 training tasks.
2023-12-28 18:07:53: Identified 0 validation tasks.
2023-12-28 18:07:53: Identified 0 test tasks.
/home/ubuntu/anaconda3/envs/IGNITE/lib/python3.9/site-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
Traceback (most recent call last):
  File "/home/ubuntu/BoltPro/December2023/GeneralVersionModels/NeurIPS/IGNITE/src/multitask_train.py", line 165, in <module>
    main()
  File "/home/ubuntu/BoltPro/December2023/GeneralVersionModels/NeurIPS/IGNITE/src/multitask_train.py", line 112, in main
    model = make_model_from_args(
  File "/home/ubuntu/BoltPro/December2023/GeneralVersionModels/NeurIPS/IGNITE/src/multitask_train.py", line 50, in make_model_from_args
    model = create_model(model_config, device=device)
  File "/home/ubuntu/BoltPro/December2023/GeneralVersionModels/NeurIPS/IGNITE/src/models/gnn_multitask.py", line 172, in create_model
    model = model.to(device)
  File "/home/ubuntu/BoltPro/December2023/GeneralVersionModels/NeurIPS/IGNITE/src/models/gnn_multitask.py", line 72, in to
    return super().to(device)
  File "/home/ubuntu/anaconda3/envs/IGNITE/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1160, in to
    return self._apply(convert)
  File "/home/ubuntu/anaconda3/envs/IGNITE/lib/python3.9/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/ubuntu/anaconda3/envs/IGNITE/lib/python3.9/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/ubuntu/anaconda3/envs/IGNITE/lib/python3.9/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/home/ubuntu/anaconda3/envs/IGNITE/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1158, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

> /home/ubuntu/anaconda3/envs/IGNITE/lib/python3.9/site-packages/torch/nn/modules/module.py(1158)convert()
-> return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
(Pdb)

Answer 1 · 2024-01-02T03:37:48.000Z

Two things:

It looks like either the DATA_PATH='../ignite_dataset/' does not contain any data or the task_list_file='datasets/ignite_data.json' is not correctly configured. This is implied by the log:

2023-12-28 18:07:53: Identified 0 training tasks.
2023-12-28 18:07:53: Identified 0 validation tasks.
2023-12-28 18:07:53: Identified 0 test tasks.

The error message is cased by the FLAGS specifying --cuda 5, but your server not having a device at this ordinal. If you have a cuda enabled device with at least a single GPU, then specifying --cuda 0, should fix the error.

See https://stackoverflow.com/questions/22175825/cuda-invalid-device-ordinal.