Cannot select specific coda device
Closed this issue ยท 3 comments
Search before asking
- I have searched the YOLOv5 issues and discussions and found no similar questions.
Question
Hello everyone,
unfortunately I cannot select a specific CUDA device.
I run the following:
python train.py --epochs 10 --device 0
I get
train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=10, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=data/hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False
github: up to date with https://github.com/ultralytics/yolov5 โ
Traceback (most recent call last):
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 306, in _lazy_init
queued_call()
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 174, in _check_capability
capability = get_device_capability(d)
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
prop = get_device_properties(device)
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 448, in get_device_properties
return _get_device_properties(device) # type: ignore[name-defined]
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 848, in <module>
main(opt)
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 607, in main
device = select_device(opt.device, batch_size=opt.batch_size)
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/utils/torch_utils.py", line 134, in select_device
p = torch.cuda.get_device_properties(i)
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 312, in _lazy_init
raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at "../aten/src/ATen/cuda/CUDAContext.cpp":50, please report a bug to PyTorch. device=, num_gpus=
CUDA call was originally invoked at:
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/train.py", line 34, in <module>
import torch
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/__init__.py", line 1478, in <module>
_C._initExtension(manager_path())
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 238, in <module>
_lazy_call(_check_capability)
File "/raid/USERDATA/pawlodwp/yolo_detectors/yolov5/YOLOv5_test_env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 235, in _lazy_call
_queued_calls.append((callable, traceback.format_stack()))
But if I simply omit -device 0
, then it works. However, all CUDA devices are selected.
I have installed the following pip packages
Package Version
------------------------ --------------------
absl-py 2.1.0
albumentations 1.4.4
annotated-types 0.6.0
certifi 2024.2.2
charset-normalizer 3.3.2
contourpy 1.2.1
cycler 0.12.1
filelock 3.13.4
fonttools 4.51.0
fsspec 2024.3.1
gitdb 4.0.11
GitPython 3.1.43
grpcio 1.62.2
idna 3.7
imageio 2.34.1
Jinja2 3.1.3
joblib 1.4.0
kiwisolver 1.4.5
lazy_loader 0.4
Markdown 3.6
MarkupSafe 2.1.5
matplotlib 3.8.4
mpmath 1.3.0
networkx 3.3
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.1.105
opencv-python 4.9.0.80
opencv-python-headless 4.9.0.80
packaging 24.0
pandas 2.2.2
pillow 10.3.0
pip 22.0.2
protobuf 5.26.1
psutil 5.9.8
py-cpuinfo 9.0.0
pydantic 2.7.1
pydantic_core 2.18.2
pyparsing 3.1.2
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
requests 2.31.0
scikit-image 0.23.2
scikit-learn 1.4.2
scipy 1.13.0
seaborn 0.13.2
setuptools 69.5.1
six 1.16.0
smmap 5.0.1
sympy 1.12
tensorboard 2.16.2
tensorboard-data-server 0.7.2
thop 0.1.1.post2209072238
threadpoolctl 3.4.0
tifffile 2024.4.24
torch 2.3.0
torchvision 0.18.0
tqdm 4.66.2
triton 2.3.0
typing_extensions 4.11.0
tzdata 2024.1
ultralytics 8.2.3
urllib3 2.2.1
Werkzeug 3.0.2
wheel 0.43.0
Python 3.10.12
Additional
No response
ok, it seems that something is wrong with pytorch v2.3. The error does not occur with v2.2.2.
Hey there! ๐ Thanks for pinpointing that the issue seems tied to PyTorch v2.3. Different versions of PyTorch can have unique behaviors or bugs that affect other software, including YOLOv5.
For now, sticking with PyTorch v2.2.2 where you're not encountering this error sounds like a solid workaround. It's always good practice to test different versions of dependencies if you run into issues. If anything else comes up or if you have further questions, feel free to ask! Your observations make a valuable contribution to the community. ๐
๐ Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.
For additional resources and information, please see the links below:
- Docs: https://docs.ultralytics.com
- HUB: https://hub.ultralytics.com
- Community: https://community.ultralytics.com
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLO ๐ and Vision AI โญ