AliyunContainerService/gpushare-device-plugin

GPU device not detected with nvidia driver > 430.XX

ptonelli opened this issue · 1 comments

When running with 450.XX or 460.XX drivers, the logs of the pod are:

gpumanager.go:28] Loading NVML
gpumanager.go:31] Failed to initialize NVML: could not load NVML library.
gpumanager.go:32] If this is a GPU node, did you set the docker default runtime to `nvidia`?

The nvidia driver is running correctly on the machine as nvidia-smi show the gpu.

We are currently trying to update the dependancies of the project and rebuilding the device plugin but have failed to solve the issue.

by lowering the linux kernel image version from 5.10 to 4.18, it solved the issue.