JosephKJ/OWOD

Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU

JohnWuzh opened this issue · 1 comments

When running the "replicate.sh", there will be problems. When running "python tools/train_net.py --num-gpus 4 --dist-url='tcp://127.0.0.1:52133' --config-file ./configs/OWOD/t1/t1_val.yaml SOLVER. IMS_PER_BATCH 4 SOLVER.BASE_LR 0.01 OWOD.TEMPERATURE 1.5 OUTPUT_DIR "./output/t1_final" ", this problem can also occur. The question is:

image

And then executing "nvidia-smi", the following information is displayed:
"Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU"

Looking forward your responding! Thanks very much!

Hi @ia-heng : this seems to be an NVIDIA driver issue. Please check with your system administrator. Thank you.