"Failed to initialize NVML: Unknown Error" after random amount of time
iFede94 opened this issue Β· 79 comments
1. Issue or feature description
After a random amount of time (it could be hours or days) the GPUs become unavailable inside all the running containers and nvidia-smi
returns "Failed to initialize NVML: Unknown Error".
A restart of all the containers fixes the issue and the GPUs return available.
Outside the containers the GPUs are still working correctly.
I tried searching in the open/closed issues but I could not find any solution.
2. Steps to reproduce the issue
All the containers are run with docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash
3. Information to attach
- Some nvidia-container information:
nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I0831 10:36:45.129762 2174149 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0831 10:36:45.129878 2174149 nvc.c:350] using root /
I0831 10:36:45.129892 2174149 nvc.c:351] using ldcache /etc/ld.so.cache
I0831 10:36:45.129906 2174149 nvc.c:352] using unprivileged user 1000:1000
I0831 10:36:45.129960 2174149 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0831 10:36:45.130411 2174149 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0831 10:36:45.132458 2174150 nvc.c:273] failed to set inheritable capabilities
W0831 10:36:45.132555 2174150 nvc.c:274] skipping kernel modules load due to failure
I0831 10:36:45.133242 2174151 rpc.c:71] starting driver rpc service
I0831 10:36:45.141625 2174152 rpc.c:71] starting nvcgo rpc service
I0831 10:36:45.144941 2174149 nvc_info.c:766] requesting driver information with ''
I0831 10:36:45.146226 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.515.48.07
I0831 10:36:45.146379 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.48.07
I0831 10:36:45.146563 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.48.07
I0831 10:36:45.146792 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07
I0831 10:36:45.146986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.515.48.07
I0831 10:36:45.147178 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.48.07
I0831 10:36:45.147375 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.48.07
I0831 10:36:45.147400 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.48.07
I0831 10:36:45.147598 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.48.07
I0831 10:36:45.147777 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.48.07
I0831 10:36:45.147986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.48.07
I0831 10:36:45.148258 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.515.48.07
I0831 10:36:45.148506 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.515.48.07
I0831 10:36:45.148699 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.48.07
I0831 10:36:45.148915 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.48.07
I0831 10:36:45.148942 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.48.07
I0831 10:36:45.149219 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.48.07
I0831 10:36:45.149467 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.515.48.07
I0831 10:36:45.149591 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.515.48.07
I0831 10:36:45.149814 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.515.48.07
I0831 10:36:45.149996 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.515.48.07
I0831 10:36:45.150224 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07
I0831 10:36:45.150437 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.48.07
I0831 10:36:45.150772 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.515.48.07
I0831 10:36:45.150978 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07
I0831 10:36:45.151147 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.515.48.07
I0831 10:36:45.151335 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.515.48.07
I0831 10:36:45.151592 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.515.48.07
I0831 10:36:45.151786 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.515.48.07
I0831 10:36:45.151970 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.515.48.07
I0831 10:36:45.152225 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.515.48.07
I0831 10:36:45.152480 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.515.48.07
I0831 10:36:45.152791 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.515.48.07
I0831 10:36:45.152999 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.515.48.07
I0831 10:36:45.153254 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.515.48.07
I0831 10:36:45.153580 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.515.48.07
I0831 10:36:45.153853 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.515.48.07
I0831 10:36:45.154063 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.515.48.07
I0831 10:36:45.154259 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.515.48.07
I0831 10:36:45.154473 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07
I0831 10:36:45.154696 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.515.48.07
W0831 10:36:45.154723 2174149 nvc_info.c:399] missing library libnvidia-nscq.so
W0831 10:36:45.154726 2174149 nvc_info.c:399] missing library libcudadebugger.so
W0831 10:36:45.154729 2174149 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0831 10:36:45.154731 2174149 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0831 10:36:45.154733 2174149 nvc_info.c:399] missing library libvdpau_nvidia.so
W0831 10:36:45.154735 2174149 nvc_info.c:399] missing library libnvidia-ifr.so
W0831 10:36:45.154737 2174149 nvc_info.c:399] missing library libnvidia-cbl.so
W0831 10:36:45.154739 2174149 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0831 10:36:45.154741 2174149 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0831 10:36:45.154743 2174149 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0831 10:36:45.154746 2174149 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0831 10:36:45.154748 2174149 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0831 10:36:45.154750 2174149 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0831 10:36:45.154752 2174149 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W0831 10:36:45.154754 2174149 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0831 10:36:45.154756 2174149 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W0831 10:36:45.154758 2174149 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W0831 10:36:45.154760 2174149 nvc_info.c:403] missing compat32 library libnvoptix.so
W0831 10:36:45.154762 2174149 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0831 10:36:45.154919 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I0831 10:36:45.154945 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0831 10:36:45.154954 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0831 10:36:45.154970 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I0831 10:36:45.154980 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W0831 10:36:45.155027 2174149 nvc_info.c:425] missing binary nv-fabricmanager
I0831 10:36:45.155044 2174149 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/515.48.07/gsp.bin
I0831 10:36:45.155058 2174149 nvc_info.c:529] listing device /dev/nvidiactl
I0831 10:36:45.155061 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm
I0831 10:36:45.155063 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0831 10:36:45.155065 2174149 nvc_info.c:529] listing device /dev/nvidia-modeset
I0831 10:36:45.155080 2174149 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W0831 10:36:45.155092 2174149 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0831 10:36:45.155100 2174149 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0831 10:36:45.155102 2174149 nvc_info.c:822] requesting device information with ''
I0831 10:36:45.161039 2174149 nvc_info.c:713] listing device /dev/nvidia0 (GPU-13fd0930-06c3-5975-8720-72c72ee7a823 at 00000000:01:00.0)
I0831 10:36:45.166471 2174149 nvc_info.c:713] listing device /dev/nvidia1 (GPU-a76d37d7-5ed0-58d9-6087-b18fee984570 at 00000000:02:00.0)
NVRM version: 515.48.07
CUDA version: 11.7
Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce RTX 2080 Ti
Brand: GeForce
GPU UUID: GPU-13fd0930-06c3-5975-8720-72c72ee7a823
Bus Location: 00000000:01:00.0
Architecture: 7.5
Device Index: 1
Device Minor: 1
Model: NVIDIA GeForce RTX 2080 Ti
Brand: GeForce
GPU UUID: GPU-a76d37d7-5ed0-58d9-6087-b18fee984570
Bus Location: 00000000:02:00.0
Architecture: 7.5
I0831 10:36:45.166493 2174149 nvc.c:434] shutting down library context
I0831 10:36:45.166540 2174152 rpc.c:95] terminating nvcgo rpc service
I0831 10:36:45.166751 2174149 rpc.c:135] nvcgo rpc service terminated successfully
I0831 10:36:45.167790 2174151 rpc.c:95] terminating driver rpc service
I0831 10:36:45.167907 2174149 rpc.c:135] driver rpc service terminated successfully
- Kernel version from
uname -a
Linux wds-co-ml 5.15.0-43-generic NVIDIA/nvidia-docker#46-Ubuntu SMP Tue Jul 12 10:30:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
- Driver information from
nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Wed Aug 31 12:42:55 2022
Driver Version : 515.48.07
CUDA Version : 11.7
Attached GPUs : 2
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce RTX 2080 Ti
Product Brand : GeForce
Product Architecture : Turing
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-13fd0930-06c3-5975-8720-72c72ee7a823
Minor Number : 0
VBIOS Version : 90.02.0B.00.C7
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0710DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x150319DA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 11264 MiB
Reserved : 244 MiB
Used : 1 MiB
Free : 11018 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 3 MiB
Free : 253 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 30 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 20.87 W
Power Limit : 260.00 W
Default Power Limit : 260.00 W
Enforced Power Limit : 260.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2160 MHz
SM : 2160 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Processes : None
GPU 00000000:02:00.0
Product Name : NVIDIA GeForce RTX 2080 Ti
Product Brand : GeForce
Product Architecture : Turing
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-a76d37d7-5ed0-58d9-6087-b18fee984570
Minor Number : 1
VBIOS Version : 90.02.17.00.58
MultiGPU Board : No
Board ID : 0x200
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0710DE
Bus Id : 00000000:02:00.0
Sub System Id : 0x150319DA
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 35 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 11264 MiB
Reserved : 244 MiB
Used : 1 MiB
Free : 11018 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 27 MiB
Free : 229 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 28 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 6.66 W
Power Limit : 260.00 W
Default Power Limit : 260.00 W
Enforced Power Limit : 260.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2160 MHz
SM : 2160 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Processes : None
- Docker version from
docker version
Client: Docker Engine - Community
Version: 20.10.17
API version: 1.41
Go version: go1.17.11
Git commit: 100c701
Built: Mon Jun 6 23:02:46 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.17
API version: 1.41 (minimum version 1.12)
Go version: go1.17.11
Git commit: a89b842
Built: Mon Jun 6 23:00:51 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.6
GitCommit: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
runc:
Version: 1.1.2
GitCommit: v1.1.2-0-ga916309
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
ii libnvidia-cfg1-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-515 515.48.07-0ubuntu0.22.04.2 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA libcompute package
ii libnvidia-compute-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA libcompute package
ii libnvidia-container-tools 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.10.0-1 amd64 NVIDIA container runtime library
ii libnvidia-decode-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-decode-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA Video Decoding runtime libraries
ii libnvidia-egl-wayland1:amd64 1:1.1.9-1.1 amd64 Wayland EGL External Platform library -- shared library
ii libnvidia-encode-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVENC Video Encoding runtime library
ii libnvidia-encode-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVENC Video Encoding runtime library
ii libnvidia-extra-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 Extra libraries for the NVIDIA driver
ii libnvidia-fbc1-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-fbc1-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-gl-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii linux-modules-nvidia-515-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel nvidia modules for version 5.15.0-43
ii linux-modules-nvidia-515-generic-hwe-22.04 5.15.0-43.46 amd64 Extra drivers for nvidia-515 for the generic-hwe-22.04 flavour
ii linux-objects-nvidia-515-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel nvidia modules for version 5.15.0-43 (objects)
ii linux-signatures-nvidia-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel signatures for nvidia modules for version 5.15.0-43-generic
ii nvidia-compute-utils-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA compute utilities
ii nvidia-container-toolkit 1.10.0-1 amd64 NVIDIA container runtime hook
ii nvidia-docker2 2.11.0-1 all nvidia-docker CLI wrapper
ii nvidia-driver-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA driver metapackage
ii nvidia-kernel-common-515 515.48.07-0ubuntu0.22.04.2 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.17.1 all Tools to enable NVIDIA's Prime
ii nvidia-settings 510.47.03-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA driver support binaries
ii xserver-xorg-video-nvidia-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA binary Xorg driver
- NVIDIA container library version from
nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
- Docker command, image and tag used
docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash
The nvidia-smi
output show persistence mode as being disabled. Does the behaviour still exist when this is enabled?
Hey, I have the same problem.
2. Steps to reproduce the issue
docker run --gpus all --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
root@098b49afe624:/# nvidia-smi
Fri Sep 2 21:54:31 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.68.02 Driver Version: 510.68.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
This works until you do systemctl daemon-reload
either manually or automatically through the OS (I assume, since it eventually will fail).
(on host):
systemctl daemon-reload
(inside same running container):
root@098b49afe624:/# nvidia-smi
Failed to initialize NVML: Unknown Error
Running the container again will work fine until you do another systemctl daemon-reload
.
3. Information to attach (optional if deemed irrelevant)
- Some nvidia-container information:
nvidia-container-cli -k -d /dev/tty info
I0902 21:40:53.603015 2836338 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0902 21:40:53.603083 2836338 nvc.c:350] using root /
I0902 21:40:53.603093 2836338 nvc.c:351] using ldcache /etc/ld.so.cache
I0902 21:40:53.603100 2836338 nvc.c:352] using unprivileged user 1000:1000
I0902 21:40:53.603133 2836338 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0902 21:40:53.603287 2836338 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0902 21:40:53.607634 2836339 nvc.c:273] failed to set inheritable capabilities
W0902 21:40:53.607692 2836339 nvc.c:274] skipping kernel modules load due to failure
I0902 21:40:53.608141 2836340 rpc.c:71] starting driver rpc service
I0902 21:40:53.620107 2836341 rpc.c:71] starting nvcgo rpc service
I0902 21:40:53.621514 2836338 nvc_info.c:766] requesting driver information with ''
I0902 21:40:53.623204 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.510.68.02
I0902 21:40:53.623384 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.510.68.02
I0902 21:40:53.623470 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.510.68.02
I0902 21:40:53.623534 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.510.68.02
I0902 21:40:53.623599 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02
I0902 21:40:53.623686 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.510.68.02
I0902 21:40:53.623774 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02
I0902 21:40:53.623838 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.510.68.02
I0902 21:40:53.623900 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02
I0902 21:40:53.623987 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.510.68.02
I0902 21:40:53.624046 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.510.68.02
I0902 21:40:53.624105 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.510.68.02
I0902 21:40:53.624167 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.510.68.02
I0902 21:40:53.624270 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.510.68.02
I0902 21:40:53.624362 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.510.68.02
I0902 21:40:53.624430 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02
I0902 21:40:53.624507 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02
I0902 21:40:53.624590 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02
I0902 21:40:53.624684 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.510.68.02
I0902 21:40:53.624959 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02
I0902 21:40:53.625088 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.510.68.02
I0902 21:40:53.625151 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.510.68.02
I0902 21:40:53.625213 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.510.68.02
I0902 21:40:53.625277 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.510.68.02
W0902 21:40:53.625310 2836338 nvc_info.c:399] missing library libnvidia-nscq.so
W0902 21:40:53.625322 2836338 nvc_info.c:399] missing library libcudadebugger.so
W0902 21:40:53.625330 2836338 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0902 21:40:53.625340 2836338 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0902 21:40:53.625349 2836338 nvc_info.c:399] missing library libnvidia-ifr.so
W0902 21:40:53.625359 2836338 nvc_info.c:399] missing library libnvidia-cbl.so
W0902 21:40:53.625368 2836338 nvc_info.c:403] missing compat32 library libnvidia-ml.so
W0902 21:40:53.625376 2836338 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0902 21:40:53.625386 2836338 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0902 21:40:53.625394 2836338 nvc_info.c:403] missing compat32 library libcuda.so
W0902 21:40:53.625404 2836338 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0902 21:40:53.625413 2836338 nvc_info.c:403] missing compat32 library libnvidia-opencl.so
W0902 21:40:53.625422 2836338 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so
W0902 21:40:53.625432 2836338 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0902 21:40:53.625441 2836338 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0902 21:40:53.625450 2836338 nvc_info.c:403] missing compat32 library libnvidia-compiler.so
W0902 21:40:53.625459 2836338 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0902 21:40:53.625468 2836338 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W0902 21:40:53.625477 2836338 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0902 21:40:53.625486 2836338 nvc_info.c:403] missing compat32 library libnvidia-encode.so
W0902 21:40:53.625495 2836338 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so
W0902 21:40:53.625505 2836338 nvc_info.c:403] missing compat32 library libnvcuvid.so
W0902 21:40:53.625514 2836338 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W0902 21:40:53.625523 2836338 nvc_info.c:403] missing compat32 library libnvidia-glcore.so
W0902 21:40:53.625532 2836338 nvc_info.c:403] missing compat32 library libnvidia-tls.so
W0902 21:40:53.625541 2836338 nvc_info.c:403] missing compat32 library libnvidia-glsi.so
W0902 21:40:53.625551 2836338 nvc_info.c:403] missing compat32 library libnvidia-fbc.so
W0902 21:40:53.625561 2836338 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W0902 21:40:53.625570 2836338 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W0902 21:40:53.625579 2836338 nvc_info.c:403] missing compat32 library libnvoptix.so
W0902 21:40:53.625588 2836338 nvc_info.c:403] missing compat32 library libGLX_nvidia.so
W0902 21:40:53.625598 2836338 nvc_info.c:403] missing compat32 library libEGL_nvidia.so
W0902 21:40:53.625607 2836338 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so
W0902 21:40:53.625616 2836338 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so
W0902 21:40:53.625625 2836338 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so
W0902 21:40:53.625631 2836338 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0902 21:40:53.626022 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I0902 21:40:53.626055 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0902 21:40:53.626088 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0902 21:40:53.626139 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I0902 21:40:53.626172 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W0902 21:40:53.626281 2836338 nvc_info.c:425] missing binary nv-fabricmanager
I0902 21:40:53.626333 2836338 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/510.68.02/gsp.bin
I0902 21:40:53.626375 2836338 nvc_info.c:529] listing device /dev/nvidiactl
I0902 21:40:53.626385 2836338 nvc_info.c:529] listing device /dev/nvidia-uvm
I0902 21:40:53.626395 2836338 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0902 21:40:53.626404 2836338 nvc_info.c:529] listing device /dev/nvidia-modeset
W0902 21:40:53.626447 2836338 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket
W0902 21:40:53.626483 2836338 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0902 21:40:53.626510 2836338 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0902 21:40:53.626521 2836338 nvc_info.c:822] requesting device information with ''
I0902 21:40:53.633742 2836338 nvc_info.c:713] listing device /dev/nvidia0 (GPU-9c416c82-d801-d28f-0867-dd438d4be914 at 00000000:04:00.0)
I0902 21:40:53.640730 2836338 nvc_info.c:713] listing device /dev/nvidia1 (GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a at 00000000:05:00.0)
I0902 21:40:53.647954 2836338 nvc_info.c:713] listing device /dev/nvidia2 (GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe at 00000000:08:00.0)
I0902 21:40:53.655371 2836338 nvc_info.c:713] listing device /dev/nvidia3 (GPU-1ab2485c-121c-77db-6719-0b616d1673f4 at 00000000:09:00.0)
I0902 21:40:53.663009 2836338 nvc_info.c:713] listing device /dev/nvidia4 (GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c at 00000000:0b:00.0)
I0902 21:40:53.670891 2836338 nvc_info.c:713] listing device /dev/nvidia5 (GPU-c16444fb-bedb-106d-c188-1f330773cf39 at 00000000:84:00.0)
I0902 21:40:53.679015 2836338 nvc_info.c:713] listing device /dev/nvidia6 (GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0 at 00000000:85:00.0)
I0902 21:40:53.687078 2836338 nvc_info.c:713] listing device /dev/nvidia7 (GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28 at 00000000:89:00.0)
NVRM version: 510.68.02
CUDA version: 11.6
Device Index: 0
Device Minor: 0
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-9c416c82-d801-d28f-0867-dd438d4be914
Bus Location: 00000000:04:00.0
Architecture: 6.1
Device Index: 1
Device Minor: 1
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a
Bus Location: 00000000:05:00.0
Architecture: 6.1
Device Index: 2
Device Minor: 2
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe
Bus Location: 00000000:08:00.0
Architecture: 6.1
Device Index: 3
Device Minor: 3
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-1ab2485c-121c-77db-6719-0b616d1673f4
Bus Location: 00000000:09:00.0
Architecture: 6.1
Device Index: 4
Device Minor: 4
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c
Bus Location: 00000000:0b:00.0
Architecture: 6.1
Device Index: 5
Device Minor: 5
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-c16444fb-bedb-106d-c188-1f330773cf39
Bus Location: 00000000:84:00.0
Architecture: 6.1
Device Index: 6
Device Minor: 6
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0
Bus Location: 00000000:85:00.0
Architecture: 6.1
Device Index: 7
Device Minor: 7
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28
Bus Location: 00000000:89:00.0
Architecture: 6.1
I0902 21:40:53.687293 2836338 nvc.c:434] shutting down library context
I0902 21:40:53.687347 2836341 rpc.c:95] terminating nvcgo rpc service
I0902 21:40:53.687881 2836338 rpc.c:135] nvcgo rpc service terminated successfully
I0902 21:40:53.692819 2836340 rpc.c:95] terminating driver rpc service
I0902 21:40:53.693046 2836338 rpc.c:135] driver rpc service terminated successfully
-
Kernel version from
uname -a
Linux node5-4 5.15.0-46-generic NVIDIA/nvidia-docker#49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
-
Any relevant kernel output lines from
dmesg
Nothing relevant from dmesg, but only thing relevant from journalctl is
Sep 02 21:17:56 node5-4 systemd[1]: Reloading.
once I do asystemctl daemon-reload
-
Driver information from
nvidia-smi -a
Fri Sep 2 21:22:32 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.68.02 Driver Version: 510.68.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA TITAN X ... On | 00000000:04:00.0 Off | N/A |
| 23% 23C P8 8W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN X ... On | 00000000:05:00.0 Off | N/A |
| 23% 26C P8 9W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA TITAN X ... On | 00000000:08:00.0 Off | N/A |
| 23% 22C P8 7W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA TITAN X ... On | 00000000:09:00.0 Off | N/A |
| 23% 24C P8 8W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA TITAN X ... On | 00000000:0B:00.0 Off | N/A |
| 23% 26C P8 9W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA TITAN X ... On | 00000000:84:00.0 Off | N/A |
| 23% 25C P8 8W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA TITAN X ... On | 00000000:85:00.0 Off | N/A |
| 23% 22C P8 8W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA TITAN X ... On | 00000000:89:00.0 Off | N/A |
| 23% 23C P8 7W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- Docker version from
docker version
Client: Docker Engine - Community
Version: 20.10.17
API version: 1.41
Go version: go1.17.11
Git commit: 100c701
Built: Mon Jun 6 23:02:46 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.17
API version: 1.41 (minimum version 1.12)
Go version: go1.17.11
Git commit: a89b842
Built: Mon Jun 6 23:00:51 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.4
GitCommit: 212e8b6fa2f44b9c21b2798135fc6fb7c53efc16
runc:
Version: 1.1.1
GitCommit: v1.1.1-0-g52de29d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=============================-============-============-=====================================================
ii libnvidia-container-tools 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.10.0-1 amd64 NVIDIA container runtime library
ii nvidia-container-runtime 3.10.0-1 all NVIDIA container runtime
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.10.0-1 amd64 NVIDIA container runtime hook
- NVIDIA container library version from
nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
- NVIDIA container library logs (see troubleshooting)
I0902 22:11:39.880399 2840718 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae) I0902 22:11:39.880483 2840718 nvc.c:350] using root / I0902 22:11:39.880501 2840718 nvc.c:351] using ldcache /etc/ld.so.cache
I0902 22:11:39.880514 2840718 nvc.c:352] using unprivileged user 65534:65534
I0902 22:11:39.880559 2840718 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0902 22:11:39.880751 2840718 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment I0902 22:11:39.884769 2840724 nvc.c:278] loading kernel module nvidia
I0902 22:11:39.884931 2840724 nvc.c:282] running mknod for /dev/nvidiactl
I0902 22:11:39.884991 2840724 nvc.c:286] running mknod for /dev/nvidia0
I0902 22:11:39.885033 2840724 nvc.c:286] running mknod for /dev/nvidia1
I0902 22:11:39.885071 2840724 nvc.c:286] running mknod for /dev/nvidia2
I0902 22:11:39.885109 2840724 nvc.c:286] running mknod for /dev/nvidia3
I0902 22:11:39.885147 2840724 nvc.c:286] running mknod for /dev/nvidia4
I0902 22:11:39.885185 2840724 nvc.c:286] running mknod for /dev/nvidia5
I0902 22:11:39.885222 2840724 nvc.c:286] running mknod for /dev/nvidia6
I0902 22:11:39.885260 2840724 nvc.c:286] running mknod for /dev/nvidia7
I0902 22:11:39.885298 2840724 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps I0902 22:11:39.892775 2840724 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config I0902 22:11:39.892935 2840724 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor I0902 22:11:39.899624 2840724 nvc.c:296] loading kernel module nvidia_uvm I0902 22:11:39.899673 2840724 nvc.c:300] running mknod for /dev/nvidia-uvm I0902 22:11:39.899778 2840724 nvc.c:305] loading kernel module nvidia_modeset
I0902 22:11:39.899820 2840724 nvc.c:309] running mknod for /dev/nvidia-modeset
I0902 22:11:39.900186 2840725 rpc.c:71] starting driver rpc service I0902 22:11:39.911718 2840726 rpc.c:71] starting nvcgo rpc service I0902 22:11:39.912892 2840718 nvc_container.c:240] configuring container with 'compute utility supervised' I0902 22:11:39.913283 2840718 nvc_container.c:88] selecting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libcuda.so.470.129.06 I0902 22:11:39.913368 2840718 nvc_container.c:88] selecting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libnvidia-ptxjitcompiler.so.470.129.06 I0902 22:11:39.915116 2840718 nvc_container.c:262] setting pid to 2840712 I0902 22:11:39.915147 2840718 nvc_container.c:263] setting rootfs to /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged I0902 22:11:39.915160 2840718 nvc_container.c:264] setting owner to 0:0 I0902 22:11:39.915171 2840718 nvc_container.c:265] setting bins directory to /usr/bin I0902 22:11:39.915182 2840718 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu I0902 22:11:39.915193 2840718 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu I0902 22:11:39.915204 2840718 nvc_container.c:268] setting cudart directory to /usr/local/cuda I0902 22:11:39.915215 2840718 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig.real (host relative) I0902 22:11:39.915228 2840718 nvc_container.c:270] setting mount namespace to /proc/2840712/ns/mnt I0902 22:11:39.915240 2840718 nvc_container.c:272] detected cgroupv2 I0902 22:11:39.915271 2840718 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/system.slice/docker-5fff6f80850791d3858cb511015581375d55ae42df5eb98262ceae31ed47a7d5.scope I0902 22:11:39.915292 2840718 nvc_info.c:766] requesting driver information with '' I0902 22:11:39.916901 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.510.68.02 I0902 22:11:39.917076 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.510.68.02 I0902 22:11:39.917165 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.510.68.02
I0902 22:11:39.917236 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.510.68.02
I0902 22:11:39.917318 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02
I0902 22:11:39.917411 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.510.68.02
I0902 22:11:39.917503 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02
I0902 22:11:39.917574 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.510.68.02
I0902 22:11:39.917639 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02
I0902 22:11:39.917730 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.510.68.02
I0902 22:11:39.917794 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.510.68.02
I0902 22:11:39.917859 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.510.68.02
I0902 22:11:39.917926 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.510.68.02 I0902 22:11:39.918018 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.510.68.02 I0902 22:11:39.918109 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.510.68.02
I0902 22:11:39.918176 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02
I0902 22:11:39.918243 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02
I0902 22:11:39.918335 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02
I0902 22:11:39.918429 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.510.68.02
I0902 22:11:39.918628 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02
I0902 22:11:39.918758 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.510.68.02
I0902 22:11:39.918827 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.510.68.02
I0902 22:11:39.918896 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.510.68.02
I0902 22:11:39.918968 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.510.68.02
W0902 22:11:39.919005 2840718 nvc_info.c:399] missing library libnvidia-nscq.so W0902 22:11:39.919022 2840718 nvc_info.c:399] missing library libcudadebugger.so W0902 22:11:39.919035 2840718 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so W0902 22:11:39.919049 2840718 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0902 22:11:39.919061 2840718 nvc_info.c:399] missing library libnvidia-ifr.so
W0902 22:11:39.919074 2840718 nvc_info.c:399] missing library libnvidia-cbl.so W0902 22:11:39.919088 2840718 nvc_info.c:403] missing compat32 library libnvidia-ml.so W0902 22:11:39.919107 2840718 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0902 22:11:39.919119 2840718 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0902 22:11:39.919131 2840718 nvc_info.c:403] missing compat32 library libcuda.so
W0902 22:11:39.919144 2840718 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0902 22:11:39.919156 2840718 nvc_info.c:403] missing compat32 library libnvidia-opencl.so
W0902 22:11:39.919168 2840718 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so
W0902 22:11:39.919192 2840718 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0902 22:11:39.919206 2840718 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0902 22:11:39.919218 2840718 nvc_info.c:403] missing compat32 library libnvidia-compiler.so
W0902 22:11:39.919230 2840718 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0902 22:11:39.919242 2840718 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W0902 22:11:39.919254 2840718 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so W0902 22:11:39.919266 2840718 nvc_info.c:403] missing compat32 library libnvidia-encode.so W0902 22:11:39.919279 2840718 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so W0902 22:11:39.919291 2840718 nvc_info.c:403] missing compat32 library libnvcuvid.so W0902 22:11:39.919304 2840718 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W0902 22:11:39.919317 2840718 nvc_info.c:403] missing compat32 library libnvidia-glcore.so
W0902 22:11:39.919329 2840718 nvc_info.c:403] missing compat32 library libnvidia-tls.so W0902 22:11:39.919341 2840718 nvc_info.c:403] missing compat32 library libnvidia-glsi.so W0902 22:11:39.919353 2840718 nvc_info.c:403] missing compat32 library libnvidia-fbc.so W0902 22:11:39.919365 2840718 nvc_info.c:403] missing compat32 library libnvidia-ifr.so W0902 22:11:39.919377 2840718 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W0902 22:11:39.919388 2840718 nvc_info.c:403] missing compat32 library libnvoptix.so W0902 22:11:39.919401 2840718 nvc_info.c:403] missing compat32 library libGLX_nvidia.so W0902 22:11:39.919413 2840718 nvc_info.c:403] missing compat32 library libEGL_nvidia.so W0902 22:11:39.919426 2840718 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so W0902 22:11:39.919438 2840718 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so W0902 22:11:39.919451 2840718 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so W0902 22:11:39.919463 2840718 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I0902 22:11:39.919856 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-smi I0902 22:11:39.919895 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I0902 22:11:39.919931 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I0902 22:11:39.919985 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control I0902 22:11:39.920022 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server W0902 22:11:39.920096 2840718 nvc_info.c:425] missing binary nv-fabricmanager I0902 22:11:39.920152 2840718 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/510.68.02/gsp.bin I0902 22:11:39.920200 2840718 nvc_info.c:529] listing device /dev/nvidiactl
I0902 22:11:39.920215 2840718 nvc_info.c:529] listing device /dev/nvidia-uvm
I0902 22:11:39.920228 2840718 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0902 22:11:39.920240 2840718 nvc_info.c:529] listing device /dev/nvidia-modeset
W0902 22:11:39.920281 2840718 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket
W0902 22:11:39.920324 2840718 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0902 22:11:39.920355 2840718 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0902 22:11:39.920371 2840718 nvc_info.c:822] requesting device information with ''
I0902 22:11:39.927586 2840718 nvc_info.c:713] listing device /dev/nvidia0 (GPU-9c416c82-d801-d28f-0867-dd438d4be914 at 00000000:04:00.0)
I0902 22:11:39.934626 2840718 nvc_info.c:713] listing device /dev/nvidia1 (GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a at 00000000:05:00.0)
I0902 22:11:39.941796 2840718 nvc_info.c:713] listing device /dev/nvidia2 (GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe at 00000000:08:00.0) I0902 22:11:39.949011 2840718 nvc_info.c:713] listing device /dev/nvidia3 (GPU-1ab2485c-121c-77db-6719-0b616d1673f4 at 00000000:09:00.0) I0902 22:11:39.956304 2840718 nvc_info.c:713] listing device /dev/nvidia4 (GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c at 00000000:0b:00.0)
I0902 22:11:39.963862 2840718 nvc_info.c:713] listing device /dev/nvidia5 (GPU-c16444fb-bedb-106d-c188-1f330773cf39 at 00000000:84:00.0)
I0902 22:11:39.971543 2840718 nvc_info.c:713] listing device /dev/nvidia6 (GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0 at 00000000:85:00.0)
I0902 22:11:39.979406 2840718 nvc_info.c:713] listing device /dev/nvidia7 (GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28 at 00000000:89:00.0)
I0902 22:11:39.979522 2840718 nvc_mount.c:366] mounting tmpfs at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia
I0902 22:11:39.980084 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-smi
I0902 22:11:39.980181 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-debugdump
I0902 22:11:39.980273 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-persistenced
I0902 22:11:39.980360 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-cuda-mps-control
I0902 22:11:39.980443 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-cuda-mps-server
I0902 22:11:39.980696 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02
I0902 22:11:39.980795 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02
I0902 22:11:39.980919 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02 I0902 22:11:39.981004 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02 I0902 22:11:39.981090 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02 I0902 22:11:39.981182 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02 I0902 22:11:39.981272 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02 I0902 22:11:39.981314 2840718 nvc_mount.c:527] creating symlink /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1 I0902 22:11:39.981482 2840718 nvc_mount.c:134] mounting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libcuda.so.470.129.06 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so.470.129.06 I0902 22:11:39.981569 2840718 nvc_mount.c:134] mounting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libnvidia-ptxjitcompiler.so.470.129.06 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.129.06 I0902 22:11:39.981887 2840718 nvc_mount.c:85] mounting /usr/lib/firmware/nvidia/510.68.02/gsp.bin at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/lib/firmware/nvidia/510.68.02/gsp.bin with flags 0x7 I0902 22:11:39.981971 2840718 nvc_mount.c:230] mounting /dev/nvidiactl at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidiactl I0902 22:11:39.982876 2840718 nvc_mount.c:230] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia-uvm I0902 22:11:39.983470 2840718 nvc_mount.c:230] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia-uvm-tools I0902 22:11:39.983976 2840718 nvc_mount.c:230] mounting /dev/nvidia0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia0 I0902 22:11:39.984099 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:04:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:04:00.0 I0902 22:11:39.984695 2840718 nvc_mount.c:230] mounting /dev/nvidia1 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia1 I0902 22:11:39.984812 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:05:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:05:00.0 I0902 22:11:39.985425 2840718 nvc_mount.c:230] mounting /dev/nvidia2 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia2 I0902 22:11:39.985541 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:08:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:08:00.0 I0902 22:11:39.986207 2840718 nvc_mount.c:230] mounting /dev/nvidia3 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia3 I0902 22:11:39.986322 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:09:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:09:00.0 I0902 22:11:39.986963 2840718 nvc_mount.c:230] mounting /dev/nvidia4 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia4 I0902 22:11:39.987076 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:0b:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:0b:00.0 I0902 22:11:39.987794 2840718 nvc_mount.c:230] mounting /dev/nvidia5 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia5 I0902 22:11:39.987907 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:84:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:84:00.0 I0902 22:11:39.988593 2840718 nvc_mount.c:230] mounting /dev/nvidia6 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia6 I0902 22:11:39.988707 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:85:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:85:00.0 I0902 22:11:39.989388 2840718 nvc_mount.c:230] mounting /dev/nvidia7 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia7 I0902 22:11:39.989515 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:89:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:89:00.0 I0902 22:11:39.990197 2840718 nvc_ldcache.c:372] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged I0902 22:11:40.012422 2840718 nvc.c:434] shutting down library context I0902 22:11:40.012510 2840726 rpc.c:95] terminating nvcgo rpc service I0902 22:11:40.013110 2840718 rpc.c:135] nvcgo rpc service terminated successfully I0902 22:11:40.018693 2840725 rpc.c:95] terminating driver rpc service I0902 22:11:40.018995 2840718 rpc.c:135] driver rpc service terminated successfully
- Docker command, image and tag used
docker run --gpus all --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
nvidia-smi
Other open issues
NVIDIA/nvidia-container-toolkit#251 but this is using cgroup v1
#1661 there isn't any information posted and it's on Ubuntu 20.04 instead of 22.04
Important notes / workaround
containerd.io v1.6.7 or v1.6.8 even with no-cgroups = true
in /etc/nvidia-container-runtime/config.toml
and specifying the devices to docker run
gives Failed to initialize NVML: Unknown Error
after a systemctl daemon-reload
.
Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true
in /etc/nvidia-container-runtime/config.toml
and specify the devices to docker run
like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
@elezar Previously persistence mode was off, so this happens either way.
Also, on k8s-device-plugin/issues/289 @klueska said:
The only thing we've seen that fully resolves the issue is to upgrade to an "experimental" version of our NVIDIA container runtime that bypasses the need for libnvidia-container to change cgroup permissions out from underneath runC.
Was that merged, or is it something I should try?
@kevin-bockman the experimental mode is still a work in progress and we don't have a concrete timeline on when this will be available for testing. I will update the issue here as soon as I have more information.
The other option is to move to cgroupv2
. Since devices
are not an actual subsytem in cgroupv2
, there is no chance for containerd to undo what libnvidia-container
has done under the hood after a restart.
@klueska Sorry, with all of the information, it wasn't really clear. The problem is that it's already on cgroupv2 AFAIK. I started from a fresh install of Ubuntu 22.04.1. docker info
says it is at least.
The only way I could get this to work after a systemctl daemon-reload
is downgrading containerd.io to 1.6.6 and specify no-cgroups. The other interesting thing is with containerd v1.6.7 or v1.6.8, even specifying no-cgroups still had the issue so I'm wondering if there's more than 1 issue here. I know cgroup v2 has 'fixed' the issue for some people or so they think (this can be an intermittent issue if you don't know that the reload triggers it), but it hasn't seemed to fix it for everyone unless I'm missing something but it doesn't work on a fresh install after doing a daemon reload, or just waiting for something to be triggered by the OS.
$ docker info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Docker Buildx (Docker Inc., v0.8.2-docker)
Server:
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 4
Server Version: 20.10.17
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
cgroupns
Kernel Version: 5.15.0-46-generic
Operating System: Ubuntu 22.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 94.36GiB
Name: node5-4
ID: PPB6:APYD:PKMA:BIOZ:2Y3H:LZUV:TPHD:SBZE:XRSL:NJCB:PWMX:ZVBY
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
@kevin-bockman I had a similar experience.
In my case,
docker run -it --device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia-uvm:/dev/nvidia-uvm \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia1:/dev/nvidia1 \
--device /dev/nvidia2:/dev/nvidia2 \
--device /dev/nvidia3:/dev/nvidia3 \
--name <container_name> <image_name>
(Replace/repeat nvidia0 with other/more devices as needed.)
This setting is working in some machines and not working in other machines.
Finally, I found that working machines has containerd.io version 1.4.6-1 (ubuntu 18.04)!!!
In ubuntu 20.04 machine, containerd.io which has version 1.5.2-1 makes it work.
I tried to downgrade and upgrade the version of containerd.io to check this strategy works or not.
It works for me.
Above one is not the answer...
This prevents nmvl error from docker resource update, but nvml error still occurs after random amount of time.
Same issue. Ubuntu 22,docker ce. I will just end up writing a cron job script to check for the error and restart the container
The solution proposed by @kevin-bockman has been working without any problem for more than a month now.
Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
I am using docker-ce on Ubuntu 22, so I opted for this approach, working fine so far.
same issue on Nvidia 3090
Ubuntu 22.04.1 LTS, Driver Version: 510.85.02 CUDA Version: 11.6
Hello there.
I'm hitting the same issue here, but with containerd
rather than docker
.
Here's my configuration:
-
GPUs:
# lspci | grep -i nvidia 00:04.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
-
OS:
# cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
-
containerd release:
# containerd --version containerd containerd.io 1.6.8 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
-
nvidia-container-toolkit version:
# nvidia-container-toolkit -version NVIDIA Container Runtime Hook version 1.11.0 commit: d9de4a0
-
runc version:
# runc --version runc version 1.1.4 commit: v1.1.4-0-g5fd4c4d spec: 1.0.2-dev go: go1.17.13 libseccomp: 2.5.1
Note that the Nvidia's container toolkit has been installed with the Nvidia's GPU operator on Kubernetes (v1.25.3).
I attached the containerd configuration file and the nvidia-container-runtime configuration file to my comment.
containerd.txt
nvidia-container-runtime.txt
How I reproduce this bug:
Running on my host the following command:
# nerdctl run -n k8s.io --runtime=/usr/local/nvidia/toolkit/nvidia-container-runtime --network=host --rm -ti --name ubuntu --gpus all -v /run/nvidia/driver/usr/bin:/tmp/nvidia-bin docker.io/library/ubuntu:latest bash
After some time, the nvidia-smi
command exits with the error Failed to initialize NVML: Unknown Error
.
Traces, logs, etc...
- Here are the devices listed in the
state.json
file:{ "type": 99, "major": 195, "minor": 255, "permissions": "", "allow": false, "path": "/dev/nvidiactl", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 234, "minor": 0, "permissions": "", "allow": false, "path": "/dev/nvidia-uvm", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 234, "minor": 1, "permissions": "", "allow": false, "path": "/dev/nvidia-uvm-tools", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 195, "minor": 254, "permissions": "", "allow": false, "path": "/dev/nvidia-modeset", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 195, "minor": 0, "permissions": "", "allow": false, "path": "/dev/nvidia0", "file_mode": 438, "uid": 0, "gid": 0 }
Thank you very much for your help. π
Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.
https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc
Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.
https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc
@gengwg Can you try if your solution works by calling sudo systemctl daemon-reload
on the host? In my case (cgroupv1), it is directly breaking the pod ; so from the pod, nvidia-smi
is returning Failed to initialize NVML: Unknown Error
.
yes. that's actually the first thing i tested when upgraded v1 --> v2. it's easy to test, because it doesn't need wait a few hours/days.
to double check, i just tested it again right now.
Before:
$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)
Do the reload on that node itself:
# systemctl daemon-reload
After:
$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)
I will update the note to reflect this test too.
And I can also confirm that's what I saw on our cgroupv1 nodes too, i.e. sudo systemctl daemon-reload
immediately breaks nvidia-smi
.
Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.
https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc
Hi, what's your cgroup driver for kubelet and containerd? We meed the same problem in cgroup v2, our cgroup driver is systemd
, but if we switch the cgroup driver to cgroupfs
, the problem will disappear. I think it's the systemd cgroup driver cause the problem.
Also, if we switch the cgroup driver of docker to cgroupfs, it will also solve the problem.
Important notes / workaround
containerd.io v1.6.7 or v1.6.8 even with
no-cgroups = true
in/etc/nvidia-container-runtime/config.toml
and specifying the devices todocker run
givesFailed to initialize NVML: Unknown Error
after asystemctl daemon-reload
.Downgrading containerd.io to 1.6.6 works as long as you specify
no-cgroups = true
in/etc/nvidia-container-runtime/config.toml
and specify the devices todocker run
likedocker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
I've also tried this way, the reason why containerd 1.6.7 can't work is because runc has been updated to 1.1.3, in this version runc will ignore some char devices can't be os.Stat
in this PR. Unfortunately, the GPU related device is that kind of device, so it will go wrong.
@gengwg Thanks for sharing your document. As I run my kubernetes cluster on ubuntu 22.04, cgroupv2 is the default cgroup subsystem used.
I deployed two environments to help me making some comparisons:
- One environment is running kubernetes v1.25.3, with Nvidia's GPU operator
- One environment with only containerd & nvidia-container-toolkit
Interestingly, I never face this issue on the second environment, everything is running perfectly well.
The first environment though is running into this issue after some time.
That would probably means that Nvidia's container runtime isn't the faulty component here, but it needs more investigations on my side to be sure that I'm not missing anything.
I'll have a look at the cgroup driver as @panli889 mentioned.
Thanks again for your help
cgroup driver for kubelet, docker and containerd are all systemd
. In fact, in cgroupv1 we used to use cgroupfs, but kubelet won't start, complaining mismatch between kubelet and docker cgroup drivers. After that I changed the docker (and containerd) cgroup driver to systemd, kubelet was able to start.
# cat /etc/systemd/system/kubelet.service | grep -i cgroup
--runtime-cgroups=/systemd/system.slice \
--kubelet-cgroups=/systemd/system.slice \
--cgroup-driver=systemd \
We are in the middle of migrating docker to containerd, so we have both docker and containerd nodes. This seem fixed it for BOTH.
Docker nodes:
# docker info | grep -i cgroup
WARNING: No swap limit support
Cgroup Driver: systemd
Cgroup Version: 2
cgroupns
Containerd nodes:
$ sudo crictl info | grep -i cgroup
"SystemdCgroup": true
"SystemdCgroup": true
"systemdCgroup": false,
"disableCgroup": false,
Here is our k8s version:
$ k version --short
Client Version: v1.21.3
Server Version: v1.22.9
@gengwg Thanks for sharing your document. As I run my kubernetes cluster on ubuntu 22.04, cgroupv2 is the default cgroup subsystem used.
I deployed two environments to help me making some comparisons:
- One environment is running kubernetes v1.25.3, with Nvidia's GPU operator
- One environment with only containerd & nvidia-container-toolkit
Interestingly, I never face this issue on the second environment, everything is running perfectly well.
The first environment though is running into this issue after some time.
That would probably means that Nvidia's container runtime isn't the faulty component here, but it needs more investigations on my side to be sure that I'm not missing anything.
I'll have a look at the cgroup driver as @panli889 mentioned.
Thanks again for your help
I think ours is similar to your 2nd env, i.e. containerd & nvidia-container-toolkit. we are on k8s v1.22.9.
# containerd --version
containerd containerd.io 1.6.6 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
# dnf info nvidia-container-toolkit | grep Version
Version : 1.11.0
i posted cgroup driver info above.
@gengwg thx for your reply!
cgroup driver for kubelet, docker and containerd are all systemd.
Hmm, that's interesting, it's quite different from my situation. Would you please share your systemd version?
I can share the problems we meet, if we create a pod with GPU, there will be a related systemd scope created at the same time like cri-containerd-xxxxxx.scope
, and it records the cgroup info, if we run systemctl status
to check the status:
Warning: The unit file, source configuration file or drop-ins of cri-containerd-xxxxx.scope changed on disk. Run 'systemctl daemon-reload' to reload units.
β cri-containerd-xxx.scope - libcontainer container xxxx
Loaded: loaded (/run/systemd/transient/cri-containerd-xxxx.scope; transient)
Transient: yes
Drop-In: /run/systemd/transient/cri-containerd-xxxxx.scope.d
ββ50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUWeight.conf, 50-CPUQuotaPeriodSec.conf, 50-CPUQuota.conf, 50-AllowedCPUs.conf
Active: active (running) since Fri 2022-11-25 12:13:33 +08; 1min 47s ago
IO: 404.0K read, 0B written
Tasks: 1
Memory: 528.0K
CPU: 2.562s
CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb6b36d39_ef5b_4eb9_850d_d710bbd06096.slice/cri-containerd-xxx.scope>
ββ61265 sleep infinity
And if we check the content of file 50-DeviceAllow.conf
, we found no GPU devices info in there. Then if we run systemctl daemon-reload
, if will reproduce a ebpf cgroup program about the devices, and it will block the access of GPU devices.
So would you please also take a look at the content of DeviceAllow.conf
for some systemd scope of pod, what's in there?
Same issue with 2 x Nvidia 3090 Ti, Ubuntu 22.04.1 LTS, Driver Version: 510.85.02, CUDA Version: 11.6
I adopted the solution proposed by @kevin-bockman downgrading containerd.io from 1.6.10 to 1.6.6. After running systemctl daemon-reload
on the host machine the nvidia-smi within the container still works properly. I will check how long it lasts and I'll keep you updated.
@panli889 I checked the scope unit with systemctl status
, and this message popped up:
Warning: The unit file, source configuration file or drop-ins of cri-containerd-d35333ac42f1e08a33632fccd63028a28443f95f3c126860a8c9da20b6d27102.scope changed on disk. Run 'systemctl daemon-reload' to reload units.
After running systemctl daemon-reload
, I get the error on my container:
root@ubuntu:/# nvidia-smi
Failed to initialize NVML: Unknown Error
Here's the content of the 50-DeviceAllow.conf
file:
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m
There's indeed no reference to nvidia's devices here:
crw-rw-rw- 1 root root 195, 254 Nov 29 10:18 nvidia-modeset
crw-rw-rw- 1 root root 234, 0 Nov 29 10:18 nvidia-uvm
crw-rw-rw- 1 root root 234, 1 Nov 29 10:18 nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 Nov 29 10:18 nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 29 10:18 nvidiactl
nvidia-caps:
total 0
cr-------- 1 root root 237, 1 Nov 29 10:18 nvidia-cap1
cr--r--r-- 1 root root 237, 2 Nov 29 10:18 nvidia-cap2
@fradsj thanks for your reply, seems the same problem as us.
Here is how we solve it, hope it will help:
- Add
--pass-device-specs=true
to your k8s-device-plugin like this comment said #966 (comment) . This param will ensure GPU devices are returned by the device plugin instead of just setting the env when allocating, then the50-DeviceAllow.conf
will include GPU device info. - Ensure the runc version is under 1.1.3, as I mentioned above, runc 1.1.3 introduced an PR, it will ignore the GPU devices passed to runc in step one. opencontainers/runc#3671
Hi,
Any official way to fix this error ?
The official way is in the works.
It is based on using a new specification called CDI to do the GPU device injection, rather than relying a runc
hook to do the GPU device injection behind the back of containerd (which is a fundamental / architectural flaw of the existing nvidia-container-runtime
, and is the underlying cause of all these problems).
Until a version of both (1) the nvidia-container-runtime
and (2) the k8s-device-plugin
are released with proper support for CDI, you will need to rely on one of the workarounds described here.
There is no "official" workaround as such, but the workaround described in #1671 (comment) seems like the best one from my perspective. It relies on the already documented use of --pass-device-specs=true
in the k8s-device-plugin (which has been the workaround for years until now) combined with downgrading to a version of runc
which doesn't trigger the GPUs to be ignored.
Hmm, that's interesting, it's quite different from my situation. Would you please share your systemd version?
I can share the problems we meet, if we create a pod with GPU, there will be a related systemd scope created at the same time like
cri-containerd-xxxxxx.scope
, and it records the cgroup info, if we runsystemctl status
to check the status:Warning: The unit file, source configuration file or drop-ins of cri-containerd-xxxxx.scope changed on disk. Run 'systemctl daemon-reload' to reload units. β cri-containerd-xxx.scope - libcontainer container xxxx Loaded: loaded (/run/systemd/transient/cri-containerd-xxxx.scope; transient) Transient: yes Drop-In: /run/systemd/transient/cri-containerd-xxxxx.scope.d ββ50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUWeight.conf, 50-CPUQuotaPeriodSec.conf, 50-CPUQuota.conf, 50-AllowedCPUs.conf Active: active (running) since Fri 2022-11-25 12:13:33 +08; 1min 47s ago IO: 404.0K read, 0B written Tasks: 1 Memory: 528.0K CPU: 2.562s CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb6b36d39_ef5b_4eb9_850d_d710bbd06096.slice/cri-containerd-xxx.scope> ββ61265 sleep infinity
And if we check the content of file
50-DeviceAllow.conf
, we found no GPU devices info in there. Then if we runsystemctl daemon-reload
, if will reproduce a ebpf cgroup program about the devices, and it will block the access of GPU devices.So would you please also take a look at the content of
DeviceAllow.conf
for some systemd scope of pod, what's in ther
@panli889 sorry for late reply. was on vacation.
systemd version:
$ systemctl --version
systemd 239 (239-58.el8)
After spinning up a pod on a node:
$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-3836675c-e987-1f01-7ce7-12da20038909)
I don't see the systemd scope nor the DeviceAllow files.
$ find /etc/systemd/ | grep scope
$ sudo find /etc/ | grep -i DeviceAllow
Checked those on our env.
Here is how we solve it, hope it will help:
- Add
--pass-device-specs=true
to your k8s-device-plugin like this comment said Updating cpu-manager-policy=static causes NVML unknown error #966 (comment) . This param will ensure GPU devices are returned by the device plugin instead of just setting the env when allocating, then the50-DeviceAllow.conf
will include GPU device info.
We didn't use the --pass-device-specs=true
option, but we do have allowPrivilegeEscalation: false
. looks not the same thing.
$ k get ds nvidia-device-plugin-daemonset -n kube-system -o yaml
....
spec:
containers:
- args:
- --fail-on-init-error=false
image: xxxxx.com/k8s-device-plugin:v0.9.0
imagePullPolicy: IfNotPresent
name: nvidia-device-plugin-ctr
resources: {}
securityContext:
allowPrivilegeEscalation: false # <------
capabilities:
drop:
- ALL
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/kubelet/device-plugins
name: device-plugin
dnsPolicy: ClusterFirst
....
- Ensure the runc version is under 1.1.3, as I mentioned above, runc 1.1.3 introduced an PR, it will ignore the GPU devices passed to runc in step one. Nvidia GPU devices in systemd will be ignored after 1.1.3 opencontainers/runc#3671
Luckily we are right below 1.1.3. We pinned the version on the repo side through centos composes, so this should be safe if we do not advance the compose version.
$ runc --version
runc version 1.1.2
commit: v1.1.2-0-ga916309
spec: 1.0.2-dev
go: go1.17.11
libseccomp: 2.5.2
@panli889 running the device plugin with runc
in v1.1.2 seems to fix the situation, as the GPUs are listed in the DeviceAllow
file of the cgroup
of the container:
[Scope] DeviceAllow= DeviceAllow=/dev/char/195:255 rw DeviceAllow=/dev/char/195:0 rw DeviceAllow=char-pts rwm DeviceAllow=/dev/char/10:200 rwm DeviceAllow=/dev/char/5:2 rwm DeviceAllow=/dev/char/5:0 rwm DeviceAllow=/dev/char/1:9 rwm DeviceAllow=/dev/char/1:8 rwm DeviceAllow=/dev/char/1:7 rwm DeviceAllow=/dev/char/1:5 rwm DeviceAllow=/dev/char/1:3 rwm DeviceAllow=char-* m DeviceAllow=block-* m
Thank you very much for your help !
@klueska that's suprising to see that Nvidia's GPUs are not listed in the /dev/char
directory, as runc
is expecting to find it in. Do you know if that's expected by Nvidia's drivers developers ?
For the CDI, do you know if the kubernetes community is working with you on this, and if there's any release cycle that has been decided yet ?
Thank you very much.
I was able to reproduce this and verify that manually creating symlinks to the various nvidia devices in /dev/char
resolves the issue. I need to talk to our driver team to determine why these are not automatically created and how to get them created going forward.
At least we seem to fully understand the problem now, and know what is necessary to resolve it. In the meantime, I would recommend creating these symlinks manually to work around this issue.
having almost same issue with Quadro RTX 8000 cluster server.
I hope there is quick solution before the official fix.
I have to keep restart my docker container whenever I have this issue
GPU Operator seems to have had a release that contained a workaround. NVIDIA/gpu-operator#430 (comment)
Since I am not using GPU Operator, I have a small tool that does the same thing. I can confirm that this solves the problem in my environment. https://gist.github.com/superbrothers/5bbb80e15a7f3ad994f789165dce2938
A tool will be shipping with the next release of the nvidia container toolkit later today. Iβll update here with instructions (or point at the official documentation of its ready by then).
A tool will be shipping with the next release of the nvidia container toolkit later today. Iβll update here with instructions (or point at the official documentation of its ready by then).
Hi @klueska, can you point me to the tool/instructions for resolving this issue? Thanks!
-
Deploying GPU Operator 22.9.2 will automatically fix the issue on all K8s nodes of the cluster (the fix is integrated inside the validator pod which will run when a new node is deployed or at every reboot of the node).
-
For deployments using the standalone k8s-device-plugin (i.e. not through the use of the operator), or for standalone docker, follow one of the workarounds listed below (in order of recommendation).
- Using the
nvidia-ctk
utility:
The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in/dev/char
for all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows:
sudo nvidia-ctk system create-dev-char-symlinks \
--create-all
This command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.
A simple udev rule to enforce this can be seen below:
# This will create /dev/char symlinks to all device nodes
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system create-dev-char-symlinks --create-all"
A good place to install this rule would be:
/lib/udev/rules.d/71-nvidia-dev-char.rules
In cases where the NVIDIA GPU Driver Container is used, the path to the driver installation must be specified. In this case the command should be modified to:
sudo nvidia-ctk system create-dev-symlinks \
--create-all \
β-driver-root={{NVIDIA_DRIVER_ROOT}}
Where {{NVIDIA_DRIVER_ROOT}}
is the path to which the NVIDIA GPU Driver container installs the NVIDIA GPU driver and creates the NVIDIA Device Nodes.
-
Explicitly disabling systemd cgroup management in Docker:
Set the parameter
"exec-opts": ["native.cgroupdriver=cgroupfs"]
in the/etc/docker/daemon.json
file and restart docker. -
Downgrading to
docker.io
packages where systemd is not the default cgroup manager (and not overriding that of course).
I'm going down the route of using option 1 - using nvidia-ctk as I am running standalone Docker on Debian 11 (bullseye). I've added a udev rule but I haven't rebooted to see if it runs but I have manually executed nvidia-ctk system create-dev-char-symlinks --create-all
and it's created the symlinks. I'm using the driver packages directly from the Debian repos, not the GPU Driver container. If I run systemctl daemon-reload
, it continues to trigger the same behavior as before where I see Failed to initialize NVML: Unknown Error
messages. I've re-created my GPU containers. Is there something I am missing? Does Docker need to be restarted or is there something about the specific order of what needs to happen when outside of the kernel module needing to be loaded before creating the symlinks?
Can you show me your docker command?
Note: this does not address the issue where you still need to explicitly pass the device nodes for /dev/nvidia0, /dev/nvidia1, /dev/nvidiactl on the command line (that wonβt be fixed until CDI support is added to docker).
This fixes the issue where β even if you do explicitly pass the device nodes β you STILL lose access to the GPUs on a systemctl daemon reload.
Sure thing:
docker run -d \
--restart unless-stopped \
--name nvidia-smi-rest \
--gpus 'all,"capabilities=utility"' \
--cpus 1 \
--memory 1g \
--memory-swap 1.5g \
mbentley/nvidia-smi-rest
/etc/docker/daemon.json
:
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"storage-driver": "overlay2"
}
With systemd cgroup management you must always pass the nvidia device nodes on the docker command line (which you are not doing).
Meaning you would need to run:
docker run -d \
--restart unless-stopped \
--name nvidia-smi-rest \
--gpus 'all,"capabilities=utility"' \
--device /dev/nvidiactl \
--device /dev/nvidia0 \
...
--cpus 1 \
--memory 1g \
--memory-swap 1.5g \
mbentley/nvidia-smi-rest
This is due to the way GPU injection currently happens from within a runc hook when the --gpus
flag is used. The hook manually sets up the cgroups for the NVIDIA devices behind the back of docker/containerd/runc -- so when a systemd daemone-reload happens the cgroup access for these devices gets undone (because these runtimes had no way of telling systemd that these devices had been injected by the hook and the reload triggers it to reevaluate all cgroup rules).
This issue only started to be noticed by most people recently because the latest release of docker flipped to using systemd cgroup management by default (as opposed to cgroupfs).
The good news is, once CDI support is added to docker, this won't be necessary anymore.
docker/cli#3864
The fix with the /dev/char symlink creation works fine, thanks.
But now we also need to set PASS_DEVICE_SPECS=true
which wasn't the case before. From the documentation it was only needed if we wanted to interoperate with the CPUManager in Kubernetes, and requieres to deploy the daemonset with elevated privileges. Why is setting this var needed ?
@cdrcnm yes, that is now necessary and the documentation should be updated. It's needed now for the same reasons described in my comment above: #1671 (comment).
Note: this is an unfortunate truth for the moment and will go away once CDI becomes the standard for device injection in containerized environments (and we update the device plugin to support CDI as well). CDI support has already been added to cri-o and containerd and we are in the process of making the nvidia device plugin CDI aware. Once all the pieces are in place we will update our documentation to instruct people on how to use it.
@klueska ran into this after we fixed a similar containerd/runc issue.
We're running Kubernetes on A100s where the DGXOS distribution doesn't bake in 1.12.X of the ctk.
Is there any other options that doesn't involve manual char device creation to get people over the line?
We'll probably end up upgrading the gpu operator but it's going to be breaking between the version we currently run and the version this suggests so thinking about doing workaround first and planning that out further.
Hmmm I did a little script to create the device links:
BASE=/dev/char
for d in $( cd $BASE && find ../nvidia* -type c ); do
MAJOR_HEX=$(stat -c %t $BASE/$d)
MINOR_HEX=$(stat -c %T $BASE/$d)
MAJOR_DEC=$((16#$MAJOR_HEX))
MINOR_DEC=$((16#$MINOR_HEX))
ln -s $d /dev/char/${MAJOR_DEC}:${MINOR_DEC}
done
The bounced k3s / containerd and container.
But still get this in the container after daemon-reload:
$ nvidia-smi
Failed to initialize NVML: Unknown Error
Our environment is running k3s with containerd and gpu operator 1.11.1. We use the accept-nvidia-visible-devices-as-volume-mounts
feature of the container runtime on each host to allow a pod to share devices between containers in the same pod.
Actually symbolic links do work but only for the container that originally gets the GPU devices.
It just drops out on the sidecar container which shares the GPU by reading in the GPU devices from config map that the main container writes to on startup. See here on how we use: https://github.com/harrison-ai/cobalt-docker-rootless-nvidia-dind
Would I need to manually adjust an allow list so it doesn't drop GPU device on the sidecar when there's a daemon-reload? We actually don't care about cgroup control for these devices. It's just about soft blocking them so users don't trip over each other.
dind container ( main):
cat /sys/fs/cgroup/devices/devices.list
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm
c 195:0 rw
c 195:254 rw
c 195:255 rw
c 511:0 rw
c 511:1 rw
workspace container (secondary):
cat /sys/fs/cgroup/devices/devices.list
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm
I can manually echo into the cgroup devices.allow and things start working again but that's not ideal.
we found this case also when we upgrade ubuntu16 (kernel 4.9) to ubuntu 20 (kernel 5.4)!
docker version 20.10.7
containerd version 1.4.6
runc version rc-95
native.cgroupdriver=systemd οΌdocker and k8s recommand a long time , I think mostly cluster use itοΌ
Nothing change , Why systemctl daemon-reload
container lose GPU Device ........
I notice systemd version in runc lastest version issue opencontainers/runc#3708
from ubuntu16 kernel 4.9 to ubuntu 20 kernel 5.4 , systemd version upgrade from systemd 229 to systemd 245!
ubuntu 16 (kernel 4.9) systemd 229 cgroup v1
ubuntu 20 (kernel 5.4) systemd 245 cgroup v1 (default-hierarchy=hybrid)
So there are 3 main factor :
- device plugin --pass-device options
- runc version
- systemd version
I test it with same case found diff systemd version diff result to handle system scope config with cgoup config when daemon-reload
-
systemd add device A, device A can not find with stat(2), cgroup add device A
when systemctl daemon-reload:
a. systemd 229 clear cgroup device A
b. systemd 245 do nothing -
systemd add device A, device A cat find with stat(2), cgroup add device A
when systemctl daemon-reload:
a. systemd 229 do nothing
b. systemd 245 do nothing -
systemd not add device A, device A stat(2) do not care (find or not find), cgroup add device A
when systemctl daemon-reload:
a. systemd 229 do nothing
b. systemd 245 clear cgroup device A
With this special different system result:
we k8s cluster with --pass-device=false
systemd 229
runc rc-95
should meet case 3, so systemctl daemon-reload work success ! but we upgrade to systemd 245
, systemctl daemon-reload break container device list
Of course , different runc version how to handle Device with Systemd make this issue more mystery ! eg:
- before runc rc92οΌ runc do not sync device with systemd
- should add an not existed device path to systemd? opencontainers/runc#3671
(has been fix in this issue issue https://github.com/opencontainers/runc/issues/3708 to check systemd version 240 , maybe start systemd 240 change some .. )
with more clear , I draw an map about it, maybe help
There is an issue out against runc
discussed here opencontainers/runc#3708 (comment) that also discusses this. According to the author there were fixes merged into both main
and release-1.1
. Do your experiments contain these fixes?
I verified them yesterday, although I always passed device nodes in my tests.
There is an issue out against
runc
discussed here opencontainers/runc#3708 (comment) that also discusses this. According to the author there were fixes merged into bothmain
andrelease-1.1
. Do your experiments contain these fixes?I verified them yesterday, although I always passed device nodes in my tests.
The new released version runc 1.1.7, fix about how to handle /dev/char/xx existed or not .....
with this new fixes;
`pass-device` + `/dev/char/xx not existed` + `systemd 229 ( < 240)` reload success
`pass-device` + `/dev/char/xx not existed` + `systemd 245 ( >= 240)` reload success
`pass-device` + `/dev/char/xx existed` + `systemd 229 ( < 240)` reload success
`pass-device` + "/dev/char/xx existed" + `systemd 245 ( >= 240)` reload success
so with pass-device = true option, Nvidia GPU Driver there no need to create link /dev/char/xx ;
but when pass-device=false option, when used systemd 245 (>=240) , all runc (>= rc92) reload failed !
update map abount new runc version (1.1.7)
@gaopeiliang as per #1671 (comment), when using systemd cgroup management (and newer systemd versions) it is required to pass the device nodes when launching a container. This is a separate issue from the runc bug that was fixed or for which the /dev/char symlinks were a workaround.
Hey @klueska, is pass-device-specs
still required even after using the udev rule with nvidia-ctx
? Or can I just use nvidia-ctx
without setting pass-device-specs
in the k8s device plugin?
Yes. It is still needed. The fix ensures that device access is not lost even when you use pass-device-specs.
With systemd cgroup management you must always pass the nvidia device nodes on the docker command line (which you are not doing).
Meaning you would need to run:
docker run -d \ --restart unless-stopped \ --name nvidia-smi-rest \ --gpus 'all,"capabilities=utility"' \ --device /dev/nvidiactl \ --device /dev/nvidia0 \ ... --cpus 1 \ --memory 1g \ --memory-swap 1.5g \ mbentley/nvidia-smi-rest
This is due to the way GPU injection currently happens from within a runc hook when the
--gpus
flag is used. The hook manually sets up the cgroups for the NVIDIA devices behind the back of docker/containerd/runc -- so when a systemd daemone-reload happens the cgroup access for these devices gets undone (because these runtimes had no way of telling systemd that these devices had been injected by the hook and the reload triggers it to reevaluate all cgroup rules).This issue only started to be noticed by most people recently because the latest release of docker flipped to using systemd cgroup management by default (as opposed to cgroupfs).
The good news is, once CDI support is added to docker, this won't be necessary anymore. docker/cli#3864
Hi @klueska @elezar , what's the suggested equivalent of the docker --devices
flags for Kubernetes GPU pods using containerd?
I added pass-device-specs
and created the symlinks but it didn't work for me. I am not sure how we can pass the --devices
in a Pod spec. So does it mean this is an acknowleged issue for Kubernetes GPU workloads using systemd cgroup?
Update: tried runc 1.1.7 with systemd 245, but it didn't solve the issue.
@gaopeiliang as per #1671 (comment), when using systemd cgroup management (and newer systemd versions) it is required to pass the device nodes when launching a container. This is a separate issue from the runc bug that was fixed or for which the /dev/char symlinks were a workaround.
en ... we k8s cluster use old gpu-device-plugin with not support pass-device-specs
; so we should test it;
another questions, what's the mean no-cgroups = bool
options in config file /etc/nvidia-container-runtime/config.toml
? any spec or link about it ?
we can use pass-device-specs
+ no-cgroups = true
+ systemd
to avoid device manager problem ? @klueska @elezar
en ... we k8s cluster use old gpu-device-plugin with not support pass-device-specs ; so we should test it;
Which version are you using?
The no-cgroups
option is used to control whether the NVIDIA Container Library should update the cgroups for a container to allow access to a device. For the rootless case, where a user does not have permissions to manage cgroups, this must be disabled. I don't have enough experience to know whether your proposed combination would work as expected.
@didovesei with regards to:
I added pass-device-specs and created the symlinks but it didn't work for me. I am not sure how we can pass the --devices in a Pod spec. So does it mean this is an acknowleged issue for Kubernetes GPU workloads using systemd cgroup?
How did you add the pass-device-specs
option? This is an option typically set as an environment variable for the GPU device plugin. Which version of the plugin are you using?
@didovesei with regards to:
I added pass-device-specs and created the symlinks but it didn't work for me. I am not sure how we can pass the --devices in a Pod spec. So does it mean this is an acknowleged issue for Kubernetes GPU workloads using systemd cgroup?
How did you add the
pass-device-specs
option? This is an option typically set as an environment variable for the GPU device plugin. Which version of the plugin are you using?
Hi @elezar , I was using device plugin v0.10.0 + containerd 1.6.0 + systemd 245 + runc 1.1.7. I passed pass-device-specs
in the device plugin args.
containers:
- args:
- --fail-on-init-error=false
- --mig-strategy=mixed
- --pass-device-specs=true
I think the flag was taking effect (although not working), since now when I run nvidia-smi
in the GPU Pod after a daemon-reload, it shows the below message instead of the NVML error.
root@gpu:/# nvidia-smi
No devices were found
I might be a bit unclear in my last comment but I guess my real point is that in @klueska 's comment, it was mentioned that
Note: this does not address the issue where you still need to explicitly pass the device nodes for /dev/nvidia0, /dev/nvidia1, /dev/nvidiactl on the command line (that wonβt be fixed until CDI support is added to docker).
This fixes the issue where β even if you do explicitly pass the device nodes β you STILL lose access to the GPUs on a systemctl daemon reload.
AFAIU, however, in K8s context, the devices should be passed into the Pod through device plugin. So we shouldn't be expecting the user to explictly pass the /dev
into the Pod. Besides, I am not sure if there is an equivalent of the docker --devices
flags in a K8s Pod spec. So I was wondering given all the above points, does it mean that this is an acknowleged limitation with Nvidia K8s solution for a certain combination of configurations (like containerd+systemd+cgroup v1)?
@didovesei was the plugin running as a privileged container? This is required to pass the device nodes.
@didovesei was the plugin running as a privileged container? This is required to pass the device nodes.
@elezar It's not in privileged mode. I have been using a config similar to this one for the DP.
Is privileged mode a requirement specific to this issue, or Nvidia suggests using it for the DP in general?
See https://github.com/NVIDIA/k8s-device-plugin#setting-other-helm-chart-values (which needs an update for a disscussion on the options and setting up privileged). Privileged mode is required when passing the device specs so that the device plugin can see all the required device nodes. Otherwise it would not have the required accesss (even though this is also provided by the nvidia container toolkit).
See https://github.com/NVIDIA/k8s-device-plugin#setting-other-helm-chart-values (which needs an update for a disscussion on the options and setting up privileged). Privileged mode is required when passing the device specs so that the device plugin can see all the required device nodes. Otherwise it would not have the required accesss (even though this is also provided by the nvidia container toolkit).
Using privileged mode for DP didn't work.. But using privileged mode for user workload Pod did work. Also, it seems that as long as the user workload Pod is privileged, there aren't any problems -- DP doesn't need to be privileged, no symlinks for the char devices need to be created.
That is true, but most users don't want to run their user pods as privileged (and they shouldn't have to if everything else is set up properly).
en ... we k8s cluster use old gpu-device-plugin with not support pass-device-specs ; so we should test it;
Which version are you using?
The
no-cgroups
option is used to control whether the NVIDIA Container Library should update the cgroups for a container to allow access to a device. For the rootless case, where a user does not have permissions to manage cgroups, this must be disabled. I don't have enough experience to know whether your proposed combination would work as expected.
device-plugin version 1.0.0-beta
runc will also write cgroup fs if has device list ; so pass-device
+ no-cgroup=true
can always set sucess I tested ....
Some relevant comments & solutions from @cdesiniotis at nvidia on the matter: #1730
Some relevant comments & solutions from @cdesiniotis at nvidia on the matter: #1730
Thanks @breakingflower, that's very useful.
FYI: From the Notice:
Deploying GPU Operator 22.9.2 will automatically fix the issue on all K8s nodes of the cluster (the fix is integrated inside the validator pod which will run when a new node is deployed or at every reboot of the node).
Does sound very promising but unfortunately doesn't solve the issue.
I can confirm that using the new version of GPU Operator resolves the issue when CDI is enabled in gpu-operator
config:
cdi:
enabled: true
default: true
However, I am facing the issue where nvidia-container-toolkit-daemonset
couldn't start properly after the reboot of the machine:
Warning Failed 4m34s (x4 over 6m10s) kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=all: unknown
Any update on this?
I tried the suggested approach in #6380, but it didn't solve the problem. It is quite frustrating as I cannot rely on AKS at the moment. I hope this issue is solved soon.
@rogelioamancisidor we've heard that AKS ships with a really old version of the k8s-device-plugin (from 2019!) which doesn't support the PASS_DEVICE_SPECS flag. You will need to update the plugin to a newer one and pass this flag for things to work on AKS.
@klueska Here is the plugin that I got suggested in the other discussion plugin and I just noticed, as you mentioned, that the plugin dates 2019. Do you have a link for a newer k8s-device-plugin? I'll really appreciate it as I have tried different things without any luck.
The plugin is available here: https://github.com/NVIDIA/k8s-device-plugin the README should cover a variety of deployment options, where helm is recommended.
The latest version of the plugin is v0.14.1
.
I deployed a DaemonSet for the NVIDIA device plugin using the yaml manifest in the link that I posted. The manifest in the link includes this line - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
. isnt that manifest deploying the latest version then? PASS_DEVICE_SPECS
is also set to true
as suggested by AKS.
here is the official soluton
modify /etc/docker/docker.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"exec-opts": ["native.cgroupdriver=cgroupfs"]
}
it is working.
modify
/etc/docker/docker.json
Isn't it /etc/docker/daemon.json
?
@homjay I dont think that solution works on K8s
This is an issue as described in NVIDIA/nvidia-container-toolkit#48
Since this issue has a number of different failure modes discussed, I'm going to close this issue and ask that those still having a problem open new issues in the respective repositories.
- For
docker
command line usage against https://github.com/NVIDIA/nvidia-container-toolkit - For the GPU Device plugin against https://github.com/NVIDIA/k8s-device-plugin
We are looking to migrate all issues in this repo to https://github.com/NVIDIA/nvidia-container-toolkit in the near term.