Cannot select specific coda device

Question

Cannot select specific coda device

Closed this issue 5 days ago · 3 comments

DP1701 commented 2 months ago

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hello everyone,

unfortunately I cannot select a specific CUDA device.
I run the following:

python train.py --epochs 10 --device 0

I get

train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=data/hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False
github: up to date with https://github.com/ultralytics/yolov5 ✅
Traceback (most recent call last):
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 306, in _lazy_init
    queued_call()
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 174, in _check_capability
    capability = get_device_capability(d)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
    prop = get_device_properties(device)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 448, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 848, in <module>
    main(opt)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 607, in main
    device = select_device(opt.device, batch_size=opt.batch_size)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/utils/torch_utils.py", line 134, in select_device
    p = torch.cuda.get_device_properties(i)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 312, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=

CUDA call was originally invoked at:

  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 34, in <module>
    import torch
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/__init__.py", line 1478, in <module>
    _C._initExtension(manager_path())
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 238, in <module>
    _lazy_call(_check_capability)
  File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 235, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

But if I simply omit -device 0, then it works. However, all CUDA devices are selected.

I have installed the following pip packages

Package                  Version
------------------------ --------------------
absl-py                  2.1.0
albumentations           1.4.4
annotated-types          0.6.0
certifi                  2024.2.2
charset-normalizer       3.3.2
contourpy                1.2.1
cycler                   0.12.1
filelock                 3.13.4
fonttools                4.51.0
fsspec                   2024.3.1
gitdb                    4.0.11
GitPython                3.1.43
grpcio                   1.62.2
idna                     3.7
imageio                  2.34.1
Jinja2                   3.1.3
joblib                   1.4.0
kiwisolver               1.4.5
lazy_loader              0.4
Markdown                 3.6
MarkupSafe               2.1.5
matplotlib               3.8.4
mpmath                   1.3.0
networkx                 3.3
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
opencv-python            4.9.0.80
opencv-python-headless   4.9.0.80
packaging                24.0
pandas                   2.2.2
pillow                   10.3.0
pip                      22.0.2
protobuf                 5.26.1
psutil                   5.9.8
py-cpuinfo               9.0.0
pydantic                 2.7.1
pydantic_core            2.18.2
pyparsing                3.1.2
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
requests                 2.31.0
scikit-image             0.23.2
scikit-learn             1.4.2
scipy                    1.13.0
seaborn                  0.13.2
setuptools               69.5.1
six                      1.16.0
smmap                    5.0.1
sympy                    1.12
tensorboard              2.16.2
tensorboard-data-server  0.7.2
thop                     0.1.1.post2209072238
threadpoolctl            3.4.0
tifffile                 2024.4.24
torch                    2.3.0
torchvision              0.18.0
tqdm                     4.66.2
triton                   2.3.0
typing_extensions        4.11.0
tzdata                   2024.1
ultralytics              8.2.3
urllib3                  2.2.1
Werkzeug                 3.0.2
wheel                    0.43.0

Python 3.10.12

Additional

No response

Answer 1 · 2024-04-26T21:54:37.000Z

ok, it seems that something is wrong with pytorch v2.3. The error does not occur with v2.2.2.

Answer 2 · 2024-04-27T04:31:31.000Z

Hey there! 😊 Thanks for pinpointing that the issue seems tied to PyTorch v2.3. Different versions of PyTorch can have unique behaviors or bugs that affect other software, including YOLOv5.

For now, sticking with PyTorch v2.2.2 where you're not encountering this error sounds like a solid workaround. It's always good practice to test different versions of dependencies if you run into issues. If anything else comes up or if you have further questions, feel free to ask! Your observations make a valuable contribution to the community. 👍

Answer 3 · 2024-05-28T00:21:42.000Z

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐