mattshma/bigdata

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

mattshma opened this issue · 4 comments

查看 dmesg log如下:

[188497.595099] NVRM: No NVIDIA graphics adapter probed!
[188497.595838] nvidia-nvlink: Unregistered the Nvlink Core, major device number 239
[188549.975172] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[188549.976351] NVRM: The NVIDIA probe routine was not called for 3 device(s).
[188549.977053] NVRM: This can occur when a driver such as: 
NVRM: nouveau, rivafb, nvidiafb or rivatv 
NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[188549.978961] NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.

重启即可。

#108 merge

又遇到了这个问题,重启仍报这个问题:

# lsmod |grep nvidia
# yum list cuda-drivers
Installed Packages
cuda-drivers.x86_64                                                             384.81-1                                                              @cuda-9-0-local

驱动安装了,但模板没加载,查看内核:4.14.67-2dev917.el7.x86_64

于是开始排查:

  1. 确认下是不是 GPU 机器:
# lspci | grep -i nvidia
0000:5a:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
0000:5e:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
0000:62:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
0000:66:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
0000:b5:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
0000:c1:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
  1. cuda 驱动已安装
  2. 依赖是否安装:
# yum list kernel-devel libvdpau dkms
Installed Packages
dkms.noarch                                                                    2.6.1-1.el7                                                                    @epel  
libvdpau.x86_64                                                                1.1.1-3.el7                                                                    @base  
Available Packages
kernel-devel.x86_64                                                            3.10.0-957.1.3.el7                                                             updates
libvdpau.i686                                                                  1.1.1-3.el7                                                                    base   

安装之。
4. 查看内核的 nvdia 模块状态:

# dkms status
nvidia, 384.81: added

看状态只是 added ,需要 install 模板,继续执行命令如下:

# dkms build -m nvidia -v 384.81

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
'make' -j8 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=4.14.67-2dev917.el7.x86_64 modules.....(bad exit status: 2)
Error! Bad return status for module build on kernel: 4.14.67-2dev917.el7.x86_64 (x86_64)
Consult /var/lib/dkms/nvidia/384.81/build/make.log for more information.

查看 /var/lib/dkms/nvidia/384.81/build/make.log,如下:

 CONFTEST: is_export_symbol_gpl_refcount_dec_and_test
  SYMLINK /usr/src/nvidia-384.81/nvidia/nv-kernel.o
  LD [M]  /usr/src/nvidia-384.81/nvidia.o
ld: cannot find /usr/src/nvidia-384.81/nvidia/nv-frontend.o: No such file or directory
ld: cannot find /usr/src/nvidia-384.81/nvidia/nv-instance.o: No such file or directory
ld: cannot find /usr/src/nvidia-384.81/nvidia/nv-gpu-numa.o: No such file or directory
ld: cannot find /usr/src/nvidia-384.81/nvidia/nv.o: No such file or directory
ld: cannot find /usr/src/nvidia-384.81/nvidia/nv-acpi.o: No such file or directory
ld: cannot find /usr/src/nvidia-384.81/nvidia/nv-chrdev.o: No such file or directory
ld: cannot find /usr/src/nvidia-384.81/nvidia/nv-cray.o: No such file or directory
ld: cannot find /usr/src/nvidia-384.81/nvidia/nv-dma.o: No such file or directory
ld: cannot find /usr/src/nvidia-384.81/nvidia/nv-gvi.o: No such file or directory

查了下相关问题,看到两个解决方案:Can't build 375.26 on Linux 4.9nvidia dkms fails after update from 4.15.9 to 4/15/16,于是尝试:

# yum remove elfutils-libelf-devel
# yum install cuda dkms
# dkms build -m nvidia -v 384.81                           

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
'make' -j8 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=4.14.67-2dev917.el7.x86_64 modules.....(bad exit status: 2)
Error! Bad return status for module build on kernel: 4.14.67-2dev917.el7.x86_64 (x86_64)
Consult /var/lib/dkms/nvidia/384.81/build/make.log for more information.

查看 /var/lib/dkms/nvidia/384.81/build/make.log,报错又变成:

/var/lib/dkms/nvidia/384.81/build/nvidia-uvm/uvm8_va_block.c: In function ‘block_cpu_fault_locked’:
/var/lib/dkms/nvidia/384.81/build/nvidia-uvm/uvm8_va_block.c:8771:41: error: implicit declaration of function ‘task_stack_page’ [-Werror=implicit-function-declaratio
n]
                                         KSTK_EIP(current));
                                         ^
cc1: some warnings being treated as errors

感觉这条路走偏了,回到起点。

尝试重新安装 yum install cuda,报错如下:

Building module:
cleaning build area...
'make' -j8 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=4.10.13-1.el7.elrepo.x86_64 modules...(bad exit status: 2)
Error! Bad return status for module build on kernel: 4.10.13-1.el7.elrepo.x86_64 (x86_64)
Consult /var/lib/dkms/nvidia/384.81/build/make.log for more information.

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
'make' -j8 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=4.10.13-1.el7.elrepo.x86_64 modules...(bad exit status: 2)
Error! Bad return status for module build on kernel: 4.10.13-1.el7.elrepo.x86_64 (x86_64)
Consult /var/lib/dkms/nvidia/384.81/build/make.log for more information.
warning: %post(nvidia-kmod-1:384.81-2.el7.x86_64) scriptlet failed, exit status 10
Non-fatal POSTIN scriptlet failure in rpm package 1:nvidia-kmod-384.81-2.el7.x86_64
  Installing : 1:xorg-x11-drv-nvidia-384.81-1.el7.x86_64

看起来和通过 dkms 安装报错一样。由于之前出现过内核版本与驱动版本不兼容的情况,通过升级 gpu 驱动或降级内核版本解决了问题,所以又排查了和内核版本的问题,无果。猜想可能 nvidia 相关的其他包未卸载干净,重新删除安装的 cuda 及 nvidia-kmod, xrog-x11-drv-nvidia, nvidia-modprobe, nvidia-driver-cuda-libs,nvidia-driver-NVML 等:

$ rpm -qa |grep nvidia
$ sudo yum remove nvidia-modprobe nvidia-driver-cuda-libs nvidia-driver-NVML nvidia-kmod xrog-x11-drv-nvidia nvidia-modprobe nvidia-libXNVCtrl-devel nvidia-xconfig nvidia-diag-driver-local-repo-rhel7-410.79-1.0-1
$ rpm -qa |grep nvidia
$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

重启机器。

重启后执行命令:

$ lspci | grep -i nvidia
5a:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
5e:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
62:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
66:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
b5:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
c1:00.0 3D controller: NVIDIA Corporation Device 1db6 (rev a1)
$ nvidia-smi
No devices were found

查看 dmesg 信息:

[   71.642638] NVRM: rm_init_adapter failed for device bearing minor number 0
[   72.464409] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[   72.464712] NVRM: rm_init_adapter failed for device bearing minor number 1
[   73.285400] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[   73.285716] NVRM: rm_init_adapter failed for device bearing minor number 2
[   74.110671] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[   74.110985] NVRM: rm_init_adapter failed for device bearing minor number 3
[   74.900252] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[   74.900534] NVRM: rm_init_adapter failed for device bearing minor number 4
[   75.704240] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[   75.704541] NVRM: rm_init_adapter failed for device bearing minor number 5

执行 nvidia-bug-report.sh,查看生成的 log,报错找不到 nvidia-modeset.o

删除 cuda

$ rpm -qa |grep cuda |grep -v nccl |xargs sudo yum -y remove
$ rm -rf /usr/lib/modules/4.10.13-1.el7.elrepo.x86_64
$ rm -rf /var/lib/dkms/nvidia/384.81/4.10.13-1.el7.elrepo.x86_64

接上。

由于该机器上 rpm 包安装的 cuda 总报部分模块找不到的问题,于是官网下载 run 文件重新安装。安装后,模块的问题没了,执行 nvidia-smi,仍报错:No devices were found

查看 /var/log/nvidia-installer.log,无异常信息。
查看 dmesg,仍是如下报错:

 5423.051976] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[ 5423.052184] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 5423.873076] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[ 5423.873312] NVRM: rm_init_adapter failed for device bearing minor number 1
[ 5424.701605] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[ 5424.701837] NVRM: rm_init_adapter failed for device bearing minor number 2
[ 5425.523365] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[ 5425.523590] NVRM: rm_init_adapter failed for device bearing minor number 3
[ 5426.312144] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[ 5426.312362] NVRM: rm_init_adapter failed for device bearing minor number 4
[ 5427.105660] NVRM: RmInitAdapter failed! (0x26:0xffff:1102)
[ 5427.105910] NVRM: rm_init_adapter failed for device bearing minor number 5

执行

dkms status
nvidia, 384.81, 4.10.13-1.el7.elrepo.x86_64, x86_64: installed

执行:

sudo ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

执行 nvidia-bug-report.sh,查看生成的 log,有如下信息:

ERROR: GetCaptureBuffer failed, Not Supported, bufSize: 0x20
ERROR: internal_getDumpBuffer failed, return code: 0x3
ERROR: internal_dumpSystemComponent() failed, return code: 0x3

查看 gpu 信息:

cat /proc/driver/nvidia/gpus/0000\:5a\:00.0/information
Model: 		 Graphics Device
IRQ:   		 213
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:5a:00.0
Device Minor: 	 0

由于该机器之前安装的驱动版本 384.145 是能执行的。

没办法,还是安装 384.145 的驱动版本,然后再安装 cuda 9,镜像中依赖的驱动版本是 384.81,不过测试了下该机器上的容器,服务仍能执行。纠结几天的问题,没想到最后还是升级版本后解决。

在一次执行中报错:

# dkms build -m nvidia -v 418.67
Error! echo
Your kernel headers for kernel 3.10.0-693.21.1.el7.x86_64 cannot be found at
/lib/modules/3.10.0-693.21.1.el7.x86_64/build or /lib/modules/3.10.0-693.21.1.el7.x86_64/source.

还需要安装 kernel-headers。