GOATmessi8/ASFF

problems withs training

Opened this issue · 0 comments

when i train this project,I encountered the following problems, and it was stuck after “using cuda”, and there was no change.
The package I installed according to the requirements of the code is not missing. My Python version is 1.3.1, CUDA is version 10.1, and Ubuntu is 16.04. GPU is NVIDIA 418.87.01, and there are only four GPUs. Therefore, the command I execute is:

python -m torch.distributed.launch --nproc_per_node=10 --master_port=${RANDOM+10000} main.py --cfg config/yolov3_baseline.cfg -d COCO --distributed --ngpu 4 --checkpoint weights/YOLOv3-baseline_38.8.pth --start_epoch 0 --half -s 608

Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=2, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False)
Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=6, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False)
Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=4, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False)
Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=5, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False)
Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=7, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False)
Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=1, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False)
Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=0, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False)
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp line=37 error=101 : invalid device ordinal
Traceback (most recent call last):
File "main.py", line 470, in
main()
File "main.py", line 98, in main
torch.cuda.set_device(args.local_rank)
File "/home/xxx/anaconda3/envs/asff/lib/python3.6/site-packages/torch/cuda/init.py", line 300, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp:37
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp line=37 error=101 : invalid device ordinal
Traceback (most recent call last):
File "main.py", line 470, in
main()
File "main.py", line 98, in main
torch.cuda.set_device(args.local_rank)
File "/home/xxx/anaconda3/envs/asff/lib/python3.6/site-packages/torch/cuda/init.py", line 300, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp:37
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp line=37 error=101 : invalid device ordinal
Traceback (most recent call last):
File "main.py", line 470, in
main()
File "main.py", line 98, in main
torch.cuda.set_device(args.local_rank)
File "/home/xxx/anaconda3/envs/asff/lib/python3.6/site-packages/torch/cuda/init.py", line 300, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp:37
Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=8, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False)
Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=9, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False)
Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='weights/YOLOv3-baseline_38.8.pth', dataset='COCO', debug=False, distributed=True, dropblock=False, eval_interval=10, half=True, local_rank=3, log_dir='log/', n_cpu=4, ngpu=4, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=608, testset=False, tfboard=False, use_cuda=True, vis=False)

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp line=37 error=101 : invalid device ordinal
Traceback (most recent call last):
File "main.py", line 470, in
main()
File "main.py", line 98, in main
torch.cuda.set_device(args.local_rank)
File "/home/xxx/anaconda3/envs/asff/lib/python3.6/site-packages/torch/cuda/init.py", line 300, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1573049304260/work/torch/csrc/cuda/Module.cpp:37

successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'darknet53'}, 'TRAIN': {'LR': 0.001, 'MOMe, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 5, 'IMGSIZE': 608, 'IGNORETH65, 'IMGSIZE': 608}}
loading annotations into memory...
successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'darknet53'}, 'TRAIN': {'LR': 0.001, 'MOMe, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 5, 'IMGSIZE': 608, 'IGNORETH65, 'IMGSIZE': 608}}
loading annotations into memory...
successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'darknet53'}, 'TRAIN': {'LR': 0.001, 'MOMe, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 5, 'IMGSIZE': 608, 'IGNORETH65, 'IMGSIZE': 608}}
loading annotations into memory...
Done (t=17.82s)
creating index...
Done (t=17.83s)
creating index...
index created!
Training YOLOv3 strong baseline!
index created!
Training YOLOv3 strong baseline!
Done (t=19.36s)
creating index...
index created!
Training YOLOv3 strong baseline!
loading pytorch ckpt... weights/YOLOv3-baseline_38.8.pth
using cuda
loading pytorch ckpt... weights/YOLOv3-baseline_38.8.pth
using cuda
loading pytorch ckpt... weights/YOLOv3-baseline_38.8.pth
using cuda
successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'darknet53'}, 'TRAIN': {'LR': 0.001, 'MOMe, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 5, 'IMGSIZE': 608, 'IGNORETH65, 'IMGSIZE': 608}}
loading annotations into memory...
Done (t=17.86s)
creating index...
index created!
Training YOLOv3 strong baseline!
loading pytorch ckpt... weights/YOLOv3-baseline_38.8.pth
using cuda

I don't know exactly where the mistake is. I hope you can give me some advice. Thank you