microsoft/ANCE

CUDA nccl library issue

francomarianardini opened this issue · 0 comments

Hello,

I cloned this repository because I am interested in running the run_inference.sh command. I followed the steps listed in the readme. However, when I run run_inference, I got the following error

RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

My system has NCCL v2.7.8 correctly installed with the corresponding CUDA toolkit.

What am I missing here?

thanks in advance for the help.

best,

Franco Maria