NVIDIA/nvidia-docker

"Failed to initialize NVML: Unknown Error" after random amount of time

iFede94 opened this issue Β· 79 comments

1. Issue or feature description

After a random amount of time (it could be hours or days) the GPUs become unavailable inside all the running containers and nvidia-smi returns "Failed to initialize NVML: Unknown Error".
A restart of all the containers fixes the issue and the GPUs return available.
Outside the containers the GPUs are still working correctly.
I tried searching in the open/closed issues but I could not find any solution.

2. Steps to reproduce the issue

All the containers are run with docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash

3. Information to attach

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --

I0831 10:36:45.129762 2174149 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0831 10:36:45.129878 2174149 nvc.c:350] using root /
I0831 10:36:45.129892 2174149 nvc.c:351] using ldcache /etc/ld.so.cache
I0831 10:36:45.129906 2174149 nvc.c:352] using unprivileged user 1000:1000
I0831 10:36:45.129960 2174149 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0831 10:36:45.130411 2174149 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0831 10:36:45.132458 2174150 nvc.c:273] failed to set inheritable capabilities
W0831 10:36:45.132555 2174150 nvc.c:274] skipping kernel modules load due to failure
I0831 10:36:45.133242 2174151 rpc.c:71] starting driver rpc service
I0831 10:36:45.141625 2174152 rpc.c:71] starting nvcgo rpc service
I0831 10:36:45.144941 2174149 nvc_info.c:766] requesting driver information with ''
I0831 10:36:45.146226 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.515.48.07
I0831 10:36:45.146379 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.48.07
I0831 10:36:45.146563 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.48.07
I0831 10:36:45.146792 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07
I0831 10:36:45.146986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.515.48.07
I0831 10:36:45.147178 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.48.07
I0831 10:36:45.147375 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.48.07
I0831 10:36:45.147400 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.48.07
I0831 10:36:45.147598 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.48.07
I0831 10:36:45.147777 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.48.07
I0831 10:36:45.147986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.48.07
I0831 10:36:45.148258 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.515.48.07
I0831 10:36:45.148506 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.515.48.07
I0831 10:36:45.148699 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.48.07
I0831 10:36:45.148915 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.48.07
I0831 10:36:45.148942 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.48.07
I0831 10:36:45.149219 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.48.07
I0831 10:36:45.149467 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.515.48.07
I0831 10:36:45.149591 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.515.48.07
I0831 10:36:45.149814 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.515.48.07
I0831 10:36:45.149996 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.515.48.07
I0831 10:36:45.150224 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07
I0831 10:36:45.150437 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.48.07
I0831 10:36:45.150772 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.515.48.07
I0831 10:36:45.150978 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07
I0831 10:36:45.151147 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.515.48.07
I0831 10:36:45.151335 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.515.48.07
I0831 10:36:45.151592 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.515.48.07
I0831 10:36:45.151786 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.515.48.07
I0831 10:36:45.151970 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.515.48.07
I0831 10:36:45.152225 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.515.48.07
I0831 10:36:45.152480 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.515.48.07
I0831 10:36:45.152791 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.515.48.07
I0831 10:36:45.152999 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.515.48.07
I0831 10:36:45.153254 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.515.48.07
I0831 10:36:45.153580 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.515.48.07
I0831 10:36:45.153853 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.515.48.07
I0831 10:36:45.154063 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.515.48.07
I0831 10:36:45.154259 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.515.48.07
I0831 10:36:45.154473 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07
I0831 10:36:45.154696 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.515.48.07
W0831 10:36:45.154723 2174149 nvc_info.c:399] missing library libnvidia-nscq.so
W0831 10:36:45.154726 2174149 nvc_info.c:399] missing library libcudadebugger.so
W0831 10:36:45.154729 2174149 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0831 10:36:45.154731 2174149 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0831 10:36:45.154733 2174149 nvc_info.c:399] missing library libvdpau_nvidia.so
W0831 10:36:45.154735 2174149 nvc_info.c:399] missing library libnvidia-ifr.so
W0831 10:36:45.154737 2174149 nvc_info.c:399] missing library libnvidia-cbl.so
W0831 10:36:45.154739 2174149 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0831 10:36:45.154741 2174149 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0831 10:36:45.154743 2174149 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0831 10:36:45.154746 2174149 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0831 10:36:45.154748 2174149 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0831 10:36:45.154750 2174149 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0831 10:36:45.154752 2174149 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W0831 10:36:45.154754 2174149 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0831 10:36:45.154756 2174149 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W0831 10:36:45.154758 2174149 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W0831 10:36:45.154760 2174149 nvc_info.c:403] missing compat32 library libnvoptix.so
W0831 10:36:45.154762 2174149 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0831 10:36:45.154919 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I0831 10:36:45.154945 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0831 10:36:45.154954 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0831 10:36:45.154970 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I0831 10:36:45.154980 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W0831 10:36:45.155027 2174149 nvc_info.c:425] missing binary nv-fabricmanager
I0831 10:36:45.155044 2174149 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/515.48.07/gsp.bin
I0831 10:36:45.155058 2174149 nvc_info.c:529] listing device /dev/nvidiactl
I0831 10:36:45.155061 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm
I0831 10:36:45.155063 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0831 10:36:45.155065 2174149 nvc_info.c:529] listing device /dev/nvidia-modeset
I0831 10:36:45.155080 2174149 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W0831 10:36:45.155092 2174149 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0831 10:36:45.155100 2174149 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0831 10:36:45.155102 2174149 nvc_info.c:822] requesting device information with ''
I0831 10:36:45.161039 2174149 nvc_info.c:713] listing device /dev/nvidia0 (GPU-13fd0930-06c3-5975-8720-72c72ee7a823 at 00000000:01:00.0)
I0831 10:36:45.166471 2174149 nvc_info.c:713] listing device /dev/nvidia1 (GPU-a76d37d7-5ed0-58d9-6087-b18fee984570 at 00000000:02:00.0)
NVRM version:   515.48.07
CUDA version:   11.7

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce RTX 2080 Ti
Brand:          GeForce
GPU UUID:       GPU-13fd0930-06c3-5975-8720-72c72ee7a823
Bus Location:   00000000:01:00.0
Architecture:   7.5

Device Index:   1
Device Minor:   1
Model:          NVIDIA GeForce RTX 2080 Ti
Brand:          GeForce
GPU UUID:       GPU-a76d37d7-5ed0-58d9-6087-b18fee984570
Bus Location:   00000000:02:00.0
Architecture:   7.5
I0831 10:36:45.166493 2174149 nvc.c:434] shutting down library context
I0831 10:36:45.166540 2174152 rpc.c:95] terminating nvcgo rpc service
I0831 10:36:45.166751 2174149 rpc.c:135] nvcgo rpc service terminated successfully
I0831 10:36:45.167790 2174151 rpc.c:95] terminating driver rpc service
I0831 10:36:45.167907 2174149 rpc.c:135] driver rpc service terminated successfully
  • Kernel version from uname -a
Linux wds-co-ml 5.15.0-43-generic NVIDIA/nvidia-docker#46-Ubuntu SMP Tue Jul 12 10:30:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Driver information from nvidia-smi -a
==============NVSMI LOG==============

Timestamp                                 : Wed Aug 31 12:42:55 2022
Driver Version                            : 515.48.07
CUDA Version                              : 11.7

Attached GPUs                             : 2
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 2080 Ti
    Product Brand                         : GeForce
    Product Architecture                  : Turing
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-13fd0930-06c3-5975-8720-72c72ee7a823
    Minor Number                          : 0
    VBIOS Version                         : 90.02.0B.00.C7
    MultiGPU Board                        : No
    Board ID                              : 0x100
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.02.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1E0710DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x150319DA
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 11264 MiB
        Reserved                          : 244 MiB
        Used                              : 1 MiB
        Free                              : 11018 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 3 MiB
        Free                              : 253 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 30 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : 89 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 20.87 W
        Power Limit                       : 260.00 W
        Default Power Limit               : 260.00 W
        Enforced Power Limit              : 260.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2160 MHz
        SM                                : 2160 MHz
        Memory                            : 7000 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes                             : None

GPU 00000000:02:00.0
    Product Name                          : NVIDIA GeForce RTX 2080 Ti
    Product Brand                         : GeForce
    Product Architecture                  : Turing
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-a76d37d7-5ed0-58d9-6087-b18fee984570
    Minor Number                          : 1
    VBIOS Version                         : 90.02.17.00.58
    MultiGPU Board                        : No
    Board ID                              : 0x200
    GPU Part Number                       : N/A
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.02.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x02
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1E0710DE
        Bus Id                            : 00000000:02:00.0
        Sub System Id                     : 0x150319DA
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 35 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 11264 MiB
        Reserved                          : 244 MiB
        Used                              : 1 MiB
        Free                              : 11018 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 27 MiB
        Free                              : 229 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 28 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : 89 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 6.66 W
        Power Limit                       : 260.00 W
        Default Power Limit               : 260.00 W
        Enforced Power Limit              : 260.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2160 MHz
        SM                                : 2160 MHz
        Memory                            : 7000 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes                             : None
  • Docker version from docker version
Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:02:46 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:00:51 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.6
  GitCommit:        10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc:
  Version:          1.1.2
  GitCommit:        v1.1.2-0-ga916309
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
ii  libnvidia-cfg1-515:amd64                   515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-515                       515.48.07-0ubuntu0.22.04.2 all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-515:amd64                515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA libcompute package
ii  libnvidia-compute-515:i386                 515.48.07-0ubuntu0.22.04.2 i386         NVIDIA libcompute package
ii  libnvidia-container-tools                  1.10.0-1                   amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                 1.10.0-1                   amd64        NVIDIA container runtime library
ii  libnvidia-decode-515:amd64                 515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-515:i386                  515.48.07-0ubuntu0.22.04.2 i386         NVIDIA Video Decoding runtime libraries
ii  libnvidia-egl-wayland1:amd64               1:1.1.9-1.1                amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-encode-515:amd64                 515.48.07-0ubuntu0.22.04.2 amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-515:i386                  515.48.07-0ubuntu0.22.04.2 i386         NVENC Video Encoding runtime library
ii  libnvidia-extra-515:amd64                  515.48.07-0ubuntu0.22.04.2 amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-515:amd64                   515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-fbc1-515:i386                    515.48.07-0ubuntu0.22.04.2 i386         NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-515:amd64                     515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-gl-515:i386                      515.48.07-0ubuntu0.22.04.2 i386         NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  linux-modules-nvidia-515-5.15.0-43-generic 5.15.0-43.46               amd64        Linux kernel nvidia modules for version 5.15.0-43
ii  linux-modules-nvidia-515-generic-hwe-22.04 5.15.0-43.46               amd64        Extra drivers for nvidia-515 for the generic-hwe-22.04 flavour
ii  linux-objects-nvidia-515-5.15.0-43-generic 5.15.0-43.46               amd64        Linux kernel nvidia modules for version 5.15.0-43 (objects)
ii  linux-signatures-nvidia-5.15.0-43-generic  5.15.0-43.46               amd64        Linux kernel signatures for nvidia modules for version 5.15.0-43-generic
ii  nvidia-compute-utils-515                   515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA compute utilities
ii  nvidia-container-toolkit                   1.10.0-1                   amd64        NVIDIA container runtime hook
ii  nvidia-docker2                             2.11.0-1                   all          nvidia-docker CLI wrapper
ii  nvidia-driver-515                          515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA driver metapackage
ii  nvidia-kernel-common-515                   515.48.07-0ubuntu0.22.04.2 amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-515                   515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA kernel source package
ii  nvidia-prime                               0.8.17.1                   all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                            510.47.03-0ubuntu1         amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-515                           515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA driver support binaries
ii  xserver-xorg-video-nvidia-515              515.48.07-0ubuntu0.22.04.2 amd64        NVIDIA binary Xorg driver
  • NVIDIA container library version from nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
  • Docker command, image and tag used
docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash

The nvidia-smi output show persistence mode as being disabled. Does the behaviour still exist when this is enabled?

Hey, I have the same problem.

2. Steps to reproduce the issue

docker run --gpus all --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
root@098b49afe624:/# nvidia-smi 
Fri Sep  2 21:54:31 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.68.02    Driver Version: 510.68.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

This works until you do systemctl daemon-reload either manually or automatically through the OS (I assume, since it eventually will fail).

(on host):
systemctl daemon-reload

(inside same running container):

root@098b49afe624:/# nvidia-smi 
Failed to initialize NVML: Unknown Error

Running the container again will work fine until you do another systemctl daemon-reload.

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
I0902 21:40:53.603015 2836338 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0902 21:40:53.603083 2836338 nvc.c:350] using root /                                                                  
I0902 21:40:53.603093 2836338 nvc.c:351] using ldcache /etc/ld.so.cache                
I0902 21:40:53.603100 2836338 nvc.c:352] using unprivileged user 1000:1000                
I0902 21:40:53.603133 2836338 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0902 21:40:53.603287 2836338 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0902 21:40:53.607634 2836339 nvc.c:273] failed to set inheritable capabilities        
W0902 21:40:53.607692 2836339 nvc.c:274] skipping kernel modules load due to failure
I0902 21:40:53.608141 2836340 rpc.c:71] starting driver rpc service              
I0902 21:40:53.620107 2836341 rpc.c:71] starting nvcgo rpc service                  
I0902 21:40:53.621514 2836338 nvc_info.c:766] requesting driver information with ''     
I0902 21:40:53.623204 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.510.68.02
I0902 21:40:53.623384 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.510.68.02
I0902 21:40:53.623470 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.510.68.02  
I0902 21:40:53.623534 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.510.68.02
I0902 21:40:53.623599 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02 
I0902 21:40:53.623686 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.510.68.02
I0902 21:40:53.623774 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02
I0902 21:40:53.623838 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.510.68.02
I0902 21:40:53.623900 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02
I0902 21:40:53.623987 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.510.68.02
I0902 21:40:53.624046 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.510.68.02
I0902 21:40:53.624105 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.510.68.02                                                                                                                               
I0902 21:40:53.624167 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.510.68.02                                                                                                                                  
I0902 21:40:53.624270 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.510.68.02                                                                                                                               
I0902 21:40:53.624362 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.510.68.02                                                                                                                              
I0902 21:40:53.624430 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02                                                                                                                             
I0902 21:40:53.624507 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02                                                                                                                                  
I0902 21:40:53.624590 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02                                                                                                                            
I0902 21:40:53.624684 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.510.68.02                                                                                                                                     
I0902 21:40:53.624959 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02
I0902 21:40:53.625088 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.510.68.02
I0902 21:40:53.625151 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.510.68.02        
I0902 21:40:53.625213 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.510.68.02     
I0902 21:40:53.625277 2836338 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.510.68.02           
W0902 21:40:53.625310 2836338 nvc_info.c:399] missing library libnvidia-nscq.so                                        
W0902 21:40:53.625322 2836338 nvc_info.c:399] missing library libcudadebugger.so                                       
W0902 21:40:53.625330 2836338 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0902 21:40:53.625340 2836338 nvc_info.c:399] missing library libnvidia-pkcs11.so                                      
W0902 21:40:53.625349 2836338 nvc_info.c:399] missing library libnvidia-ifr.so                                         
W0902 21:40:53.625359 2836338 nvc_info.c:399] missing library libnvidia-cbl.so                                         
W0902 21:40:53.625368 2836338 nvc_info.c:403] missing compat32 library libnvidia-ml.so                                 
W0902 21:40:53.625376 2836338 nvc_info.c:403] missing compat32 library libnvidia-cfg.so                                
W0902 21:40:53.625386 2836338 nvc_info.c:403] missing compat32 library libnvidia-nscq.so                               
W0902 21:40:53.625394 2836338 nvc_info.c:403] missing compat32 library libcuda.so                                      
W0902 21:40:53.625404 2836338 nvc_info.c:403] missing compat32 library libcudadebugger.so                              
W0902 21:40:53.625413 2836338 nvc_info.c:403] missing compat32 library libnvidia-opencl.so                             
W0902 21:40:53.625422 2836338 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so                     
W0902 21:40:53.625432 2836338 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so                    
W0902 21:40:53.625441 2836338 nvc_info.c:403] missing compat32 library libnvidia-allocator.so                          
W0902 21:40:53.625450 2836338 nvc_info.c:403] missing compat32 library libnvidia-compiler.so                           
W0902 21:40:53.625459 2836338 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so                             
W0902 21:40:53.625468 2836338 nvc_info.c:403] missing compat32 library libnvidia-ngx.so                                
W0902 21:40:53.625477 2836338 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0902 21:40:53.625486 2836338 nvc_info.c:403] missing compat32 library libnvidia-encode.so
W0902 21:40:53.625495 2836338 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so
W0902 21:40:53.625505 2836338 nvc_info.c:403] missing compat32 library libnvcuvid.so
W0902 21:40:53.625514 2836338 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W0902 21:40:53.625523 2836338 nvc_info.c:403] missing compat32 library libnvidia-glcore.so                             
W0902 21:40:53.625532 2836338 nvc_info.c:403] missing compat32 library libnvidia-tls.so                  
W0902 21:40:53.625541 2836338 nvc_info.c:403] missing compat32 library libnvidia-glsi.so                               
W0902 21:40:53.625551 2836338 nvc_info.c:403] missing compat32 library libnvidia-fbc.so                                
W0902 21:40:53.625561 2836338 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W0902 21:40:53.625570 2836338 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so                             
W0902 21:40:53.625579 2836338 nvc_info.c:403] missing compat32 library libnvoptix.so                                                                                                                                                          
W0902 21:40:53.625588 2836338 nvc_info.c:403] missing compat32 library libGLX_nvidia.so                                
W0902 21:40:53.625598 2836338 nvc_info.c:403] missing compat32 library libEGL_nvidia.so
W0902 21:40:53.625607 2836338 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so
W0902 21:40:53.625616 2836338 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so                                                                                                                                                 
W0902 21:40:53.625625 2836338 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so                   
W0902 21:40:53.625631 2836338 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0902 21:40:53.626022 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-smi         
I0902 21:40:53.626055 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0902 21:40:53.626088 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0902 21:40:53.626139 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I0902 21:40:53.626172 2836338 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server                             
W0902 21:40:53.626281 2836338 nvc_info.c:425] missing binary nv-fabricmanager                            
I0902 21:40:53.626333 2836338 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/510.68.02/gsp.bin
I0902 21:40:53.626375 2836338 nvc_info.c:529] listing device /dev/nvidiactl                                    
I0902 21:40:53.626385 2836338 nvc_info.c:529] listing device /dev/nvidia-uvm                                                                                                                                                                  
I0902 21:40:53.626395 2836338 nvc_info.c:529] listing device /dev/nvidia-uvm-tools                                  
I0902 21:40:53.626404 2836338 nvc_info.c:529] listing device /dev/nvidia-modeset                               
W0902 21:40:53.626447 2836338 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket          
W0902 21:40:53.626483 2836338 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket        
W0902 21:40:53.626510 2836338 nvc_info.c:349] missing ipc path /tmp/nvidia-mps                                    
I0902 21:40:53.626521 2836338 nvc_info.c:822] requesting device information with ''                          
I0902 21:40:53.633742 2836338 nvc_info.c:713] listing device /dev/nvidia0 (GPU-9c416c82-d801-d28f-0867-dd438d4be914 at 00000000:04:00.0)                                                                                                      
I0902 21:40:53.640730 2836338 nvc_info.c:713] listing device /dev/nvidia1 (GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a at 00000000:05:00.0)                                                                                                      
I0902 21:40:53.647954 2836338 nvc_info.c:713] listing device /dev/nvidia2 (GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe at 00000000:08:00.0)                                                                                                      
I0902 21:40:53.655371 2836338 nvc_info.c:713] listing device /dev/nvidia3 (GPU-1ab2485c-121c-77db-6719-0b616d1673f4 at 00000000:09:00.0)                                                                                                      
I0902 21:40:53.663009 2836338 nvc_info.c:713] listing device /dev/nvidia4 (GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c at 00000000:0b:00.0)                                                                                                      
I0902 21:40:53.670891 2836338 nvc_info.c:713] listing device /dev/nvidia5 (GPU-c16444fb-bedb-106d-c188-1f330773cf39 at 00000000:84:00.0)                                                                                                      
I0902 21:40:53.679015 2836338 nvc_info.c:713] listing device /dev/nvidia6 (GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0 at 00000000:85:00.0)                                                                                                      
I0902 21:40:53.687078 2836338 nvc_info.c:713] listing device /dev/nvidia7 (GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28 at 00000000:89:00.0)                                                                                                      
NVRM version:   510.68.02                                                                                              
CUDA version:   11.6                                                                                                   
                                                                                                                      
Device Index:   0                                                                                                      
Device Minor:   0                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-9c416c82-d801-d28f-0867-dd438d4be914                                                               
Bus Location:   00000000:04:00.0                                                                                       
Architecture:   6.1                                                                                                    
                                                                                                                      
Device Index:   1                                                                                                      
Device Minor:   1                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a                                                               
Bus Location:   00000000:05:00.0                                                                                       
Architecture:   6.1                                                                                                    
                                                                                                                      
Device Index:   2                                                                                                      
Device Minor:   2                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe                                                               
Bus Location:   00000000:08:00.0                                                                                       
Architecture:   6.1                                                                                                    
                                                                                                                      
Device Index:   3                                                                                                      
Device Minor:   3                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-1ab2485c-121c-77db-6719-0b616d1673f4                                                               
Bus Location:   00000000:09:00.0                                                                                       
Architecture:   6.1                                                                                                    
                                                                                                                                                                                                                                             
Device Index:   4                                                                                                      
Device Minor:   4                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                                                                                                                                         
GPU UUID:       GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c                                                               
Bus Location:   00000000:0b:00.0                                                                                       
Architecture:   6.1                                                                                                    
                                                                                                                      
Device Index:   5                                                                                                      
Device Minor:   5                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-c16444fb-bedb-106d-c188-1f330773cf39                                                               
Bus Location:   00000000:84:00.0                                                                                       
Architecture:   6.1                                                                                                                                                                                                                           
                                                                                                                      
Device Index:   6                                                                                                      
Device Minor:   6                                                                                                      
Model:          NVIDIA TITAN X (Pascal)                                                                                
Brand:          TITAN                                                                                                  
GPU UUID:       GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0                                                               
Bus Location:   00000000:85:00.0                                                                                                                                                                                                              
Architecture:   6.1                                                                                                                                                                                                                           
                                                                                                                                                                                                                                             
Device Index:   7                                                                                                                                                                                                                             
Device Minor:   7                                                                                                                                                                                                                             
Model:          NVIDIA TITAN X (Pascal)                                                                                                                                                                                                       
Brand:          TITAN                                                                                                                                                                                                                         
GPU UUID:       GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28                                                                                                                                                                                      
Bus Location:   00000000:89:00.0                                                                                       
Architecture:   6.1                                                                                                    
I0902 21:40:53.687293 2836338 nvc.c:434] shutting down library context                                                 
I0902 21:40:53.687347 2836341 rpc.c:95] terminating nvcgo rpc service                                                  
I0902 21:40:53.687881 2836338 rpc.c:135] nvcgo rpc service terminated successfully                                     
I0902 21:40:53.692819 2836340 rpc.c:95] terminating driver rpc service                                                 
I0902 21:40:53.693046 2836338 rpc.c:135] driver rpc service terminated successfully                                                                                                                    
  • Kernel version from uname -a
    Linux node5-4 5.15.0-46-generic NVIDIA/nvidia-docker#49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • Any relevant kernel output lines from dmesg
    Nothing relevant from dmesg, but only thing relevant from journalctl is
    Sep 02 21:17:56 node5-4 systemd[1]: Reloading. once I do a systemctl daemon-reload

  • Driver information from nvidia-smi -a

Fri Sep  2 21:22:32 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.68.02    Driver Version: 510.68.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN X ...  On   | 00000000:04:00.0 Off |                  N/A |
| 23%   23C    P8     8W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN X ...  On   | 00000000:05:00.0 Off |                  N/A |
| 23%   26C    P8     9W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA TITAN X ...  On   | 00000000:08:00.0 Off |                  N/A |
| 23%   22C    P8     7W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA TITAN X ...  On   | 00000000:09:00.0 Off |                  N/A |
| 23%   24C    P8     8W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA TITAN X ...  On   | 00000000:0B:00.0 Off |                  N/A |
| 23%   26C    P8     9W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA TITAN X ...  On   | 00000000:84:00.0 Off |                  N/A |
| 23%   25C    P8     8W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA TITAN X ...  On   | 00000000:85:00.0 Off |                  N/A |
| 23%   22C    P8     8W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA TITAN X ...  On   | 00000000:89:00.0 Off |                  N/A |
| 23%   23C    P8     7W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  • Docker version from docker version
Client: Docker Engine - Community
 Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:02:46 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:00:51 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.4
  GitCommit:        212e8b6fa2f44b9c21b2798135fc6fb7c53efc16
 runc:
  Version:          1.1.1
  GitCommit:        v1.1.1-0-g52de29d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version      Architecture Description
+++-=============================-============-============-=====================================================
ii  libnvidia-container-tools     1.10.0-1     amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64    1.10.0-1     amd64        NVIDIA container runtime library
ii  nvidia-container-runtime      3.10.0-1     all          NVIDIA container runtime
un  nvidia-container-runtime-hook <none>       <none>       (no description available)
ii  nvidia-container-toolkit      1.10.0-1     amd64        NVIDIA container runtime hook
  • NVIDIA container library version from nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
I0902 22:11:39.880399 2840718 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)                                                                                                        I0902 22:11:39.880483 2840718 nvc.c:350] using root /                                                                                                                                                                                         I0902 22:11:39.880501 2840718 nvc.c:351] using ldcache /etc/ld.so.cache                   
I0902 22:11:39.880514 2840718 nvc.c:352] using unprivileged user 65534:65534                                                                                                                                                                  
I0902 22:11:39.880559 2840718 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)                                                                                                           I0902 22:11:39.880751 2840718 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment                                                                                                                              I0902 22:11:39.884769 2840724 nvc.c:278] loading kernel module nvidia                  
I0902 22:11:39.884931 2840724 nvc.c:282] running mknod for /dev/nvidiactl                                                                                                                                                                     
I0902 22:11:39.884991 2840724 nvc.c:286] running mknod for /dev/nvidia0                 
I0902 22:11:39.885033 2840724 nvc.c:286] running mknod for /dev/nvidia1                                                                                                                                                                       
I0902 22:11:39.885071 2840724 nvc.c:286] running mknod for /dev/nvidia2                   
I0902 22:11:39.885109 2840724 nvc.c:286] running mknod for /dev/nvidia3                                                                                                                                                                       
I0902 22:11:39.885147 2840724 nvc.c:286] running mknod for /dev/nvidia4                            
I0902 22:11:39.885185 2840724 nvc.c:286] running mknod for /dev/nvidia5                                                                                                                                                                       
I0902 22:11:39.885222 2840724 nvc.c:286] running mknod for /dev/nvidia6                      
I0902 22:11:39.885260 2840724 nvc.c:286] running mknod for /dev/nvidia7                                                                                                                                                                       
I0902 22:11:39.885298 2840724 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps                                                                                                                                                     I0902 22:11:39.892775 2840724 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config                                                                                                      I0902 22:11:39.892935 2840724 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor                                                                                                     I0902 22:11:39.899624 2840724 nvc.c:296] loading kernel module nvidia_uvm                                                                                                                                                                     I0902 22:11:39.899673 2840724 nvc.c:300] running mknod for /dev/nvidia-uvm                                                                                                                                                                    I0902 22:11:39.899778 2840724 nvc.c:305] loading kernel module nvidia_modeset              
I0902 22:11:39.899820 2840724 nvc.c:309] running mknod for /dev/nvidia-modeset                                                                                                                                                                
I0902 22:11:39.900186 2840725 rpc.c:71] starting driver rpc service                                                                                                                                                                           I0902 22:11:39.911718 2840726 rpc.c:71] starting nvcgo rpc service                                                                                                                                                                            I0902 22:11:39.912892 2840718 nvc_container.c:240] configuring container with 'compute utility supervised'                                                                                                                                    I0902 22:11:39.913283 2840718 nvc_container.c:88] selecting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libcuda.so.470.129.06                                 I0902 22:11:39.913368 2840718 nvc_container.c:88] selecting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libnvidia-ptxjitcompiler.so.470.129.06                I0902 22:11:39.915116 2840718 nvc_container.c:262] setting pid to 2840712                                                                                                                                                                     I0902 22:11:39.915147 2840718 nvc_container.c:263] setting rootfs to /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged                                                                         I0902 22:11:39.915160 2840718 nvc_container.c:264] setting owner to 0:0                                                                                                                                                                       I0902 22:11:39.915171 2840718 nvc_container.c:265] setting bins directory to /usr/bin                                                                                                                                                         I0902 22:11:39.915182 2840718 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu                                                                                                                                        I0902 22:11:39.915193 2840718 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu                                                                                                                                        I0902 22:11:39.915204 2840718 nvc_container.c:268] setting cudart directory to /usr/local/cuda                                                                                                                                                I0902 22:11:39.915215 2840718 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig.real (host relative)                                                                                                                                   I0902 22:11:39.915228 2840718 nvc_container.c:270] setting mount namespace to /proc/2840712/ns/mnt                                                                                                                                            I0902 22:11:39.915240 2840718 nvc_container.c:272] detected cgroupv2                                                                                                                                                                          I0902 22:11:39.915271 2840718 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/system.slice/docker-5fff6f80850791d3858cb511015581375d55ae42df5eb98262ceae31ed47a7d5.scope                                                        I0902 22:11:39.915292 2840718 nvc_info.c:766] requesting driver information with ''                                                                                                                                                           I0902 22:11:39.916901 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.510.68.02                                                                                                                          I0902 22:11:39.917076 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.510.68.02                                                                                                                                     I0902 22:11:39.917165 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.510.68.02  
I0902 22:11:39.917236 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.510.68.02
I0902 22:11:39.917318 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02                                                                                                                       
I0902 22:11:39.917411 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.510.68.02
I0902 22:11:39.917503 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02
I0902 22:11:39.917574 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.510.68.02
I0902 22:11:39.917639 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02
I0902 22:11:39.917730 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.510.68.02
I0902 22:11:39.917794 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.510.68.02                                                                                                                                 
I0902 22:11:39.917859 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.510.68.02                                                                                                                               
I0902 22:11:39.917926 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.510.68.02                                                                                                                                  I0902 22:11:39.918018 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.510.68.02                                                                                                                               I0902 22:11:39.918109 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.510.68.02                                                                                                                              
I0902 22:11:39.918176 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02                                                                                                                             
I0902 22:11:39.918243 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02                                                                                                                                  
I0902 22:11:39.918335 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02                                                                                                                            
I0902 22:11:39.918429 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.510.68.02                                                                                                               
I0902 22:11:39.918628 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02                                                                                                                  
I0902 22:11:39.918758 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.510.68.02                                                                                                            
I0902 22:11:39.918827 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.510.68.02                                                                                                          
I0902 22:11:39.918896 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.510.68.02                                                                                                              
I0902 22:11:39.918968 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.510.68.02
W0902 22:11:39.919005 2840718 nvc_info.c:399] missing library libnvidia-nscq.so                                                                                                                                                               W0902 22:11:39.919022 2840718 nvc_info.c:399] missing library libcudadebugger.so                                                                                                                                                              W0902 22:11:39.919035 2840718 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so                                                                                                                                                    W0902 22:11:39.919049 2840718 nvc_info.c:399] missing library libnvidia-pkcs11.so         
W0902 22:11:39.919061 2840718 nvc_info.c:399] missing library libnvidia-ifr.so                                                                                                                                                                
W0902 22:11:39.919074 2840718 nvc_info.c:399] missing library libnvidia-cbl.so                                                                                                                                                                W0902 22:11:39.919088 2840718 nvc_info.c:403] missing compat32 library libnvidia-ml.so                                                                                                                                                        W0902 22:11:39.919107 2840718 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0902 22:11:39.919119 2840718 nvc_info.c:403] missing compat32 library libnvidia-nscq.so                                                                                                                                                      
W0902 22:11:39.919131 2840718 nvc_info.c:403] missing compat32 library libcuda.so       
W0902 22:11:39.919144 2840718 nvc_info.c:403] missing compat32 library libcudadebugger.so                                                                                                                                                     
W0902 22:11:39.919156 2840718 nvc_info.c:403] missing compat32 library libnvidia-opencl.so
W0902 22:11:39.919168 2840718 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so                                                                                                                                            
W0902 22:11:39.919192 2840718 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0902 22:11:39.919206 2840718 nvc_info.c:403] missing compat32 library libnvidia-allocator.so                                                                                                                                                 
W0902 22:11:39.919218 2840718 nvc_info.c:403] missing compat32 library libnvidia-compiler.so 
W0902 22:11:39.919230 2840718 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so                                                                                                                                                    
W0902 22:11:39.919242 2840718 nvc_info.c:403] missing compat32 library libnvidia-ngx.so                                                                                                                                                       W0902 22:11:39.919254 2840718 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so                                                                                                                                                     W0902 22:11:39.919266 2840718 nvc_info.c:403] missing compat32 library libnvidia-encode.so                                                                                                                                                    W0902 22:11:39.919279 2840718 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so                                                                                                                                               W0902 22:11:39.919291 2840718 nvc_info.c:403] missing compat32 library libnvcuvid.so                                                                                                                                                          W0902 22:11:39.919304 2840718 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W0902 22:11:39.919317 2840718 nvc_info.c:403] missing compat32 library libnvidia-glcore.so                                                                                                                                                    
W0902 22:11:39.919329 2840718 nvc_info.c:403] missing compat32 library libnvidia-tls.so                                                                                                                                                       W0902 22:11:39.919341 2840718 nvc_info.c:403] missing compat32 library libnvidia-glsi.so                                                                                                                                                      W0902 22:11:39.919353 2840718 nvc_info.c:403] missing compat32 library libnvidia-fbc.so                                                                                                                                                       W0902 22:11:39.919365 2840718 nvc_info.c:403] missing compat32 library libnvidia-ifr.so                                                                                                                                                       W0902 22:11:39.919377 2840718 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so                                                                                                                                                    W0902 22:11:39.919388 2840718 nvc_info.c:403] missing compat32 library libnvoptix.so                                                                                                                                                          W0902 22:11:39.919401 2840718 nvc_info.c:403] missing compat32 library libGLX_nvidia.so                                                                                                                                                       W0902 22:11:39.919413 2840718 nvc_info.c:403] missing compat32 library libEGL_nvidia.so                                                                                                                                                       W0902 22:11:39.919426 2840718 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so                                                                                                                                                    W0902 22:11:39.919438 2840718 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so                                                                                                                                                 W0902 22:11:39.919451 2840718 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so                                                                                                                                                 W0902 22:11:39.919463 2840718 nvc_info.c:403] missing compat32 library libnvidia-cbl.so                                                                                                                                                       I0902 22:11:39.919856 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-smi                                                                                                                                                                   I0902 22:11:39.919895 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump                                                                                                                                                             I0902 22:11:39.919931 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced                                                                                                                                                          I0902 22:11:39.919985 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control                                                                                                                                                      I0902 22:11:39.920022 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server                                                                                                                                                       W0902 22:11:39.920096 2840718 nvc_info.c:425] missing binary nv-fabricmanager                                                                                                                                                                 I0902 22:11:39.920152 2840718 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/510.68.02/gsp.bin                                                                                                                                I0902 22:11:39.920200 2840718 nvc_info.c:529] listing device /dev/nvidiactl                                   
I0902 22:11:39.920215 2840718 nvc_info.c:529] listing device /dev/nvidia-uvm                                   
I0902 22:11:39.920228 2840718 nvc_info.c:529] listing device /dev/nvidia-uvm-tools                                                                                                                                                            
I0902 22:11:39.920240 2840718 nvc_info.c:529] listing device /dev/nvidia-modeset                                    
W0902 22:11:39.920281 2840718 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket             
W0902 22:11:39.920324 2840718 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket         
W0902 22:11:39.920355 2840718 nvc_info.c:349] missing ipc path /tmp/nvidia-mps                             
I0902 22:11:39.920371 2840718 nvc_info.c:822] requesting device information with ''                               
I0902 22:11:39.927586 2840718 nvc_info.c:713] listing device /dev/nvidia0 (GPU-9c416c82-d801-d28f-0867-dd438d4be914 at 00000000:04:00.0)                                                                                                      
I0902 22:11:39.934626 2840718 nvc_info.c:713] listing device /dev/nvidia1 (GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a at 00000000:05:00.0)                                                                                                      
I0902 22:11:39.941796 2840718 nvc_info.c:713] listing device /dev/nvidia2 (GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe at 00000000:08:00.0)                                                                                                      I0902 22:11:39.949011 2840718 nvc_info.c:713] listing device /dev/nvidia3 (GPU-1ab2485c-121c-77db-6719-0b616d1673f4 at 00000000:09:00.0)                                                                                                      I0902 22:11:39.956304 2840718 nvc_info.c:713] listing device /dev/nvidia4 (GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c at 00000000:0b:00.0)                                                                                                      
I0902 22:11:39.963862 2840718 nvc_info.c:713] listing device /dev/nvidia5 (GPU-c16444fb-bedb-106d-c188-1f330773cf39 at 00000000:84:00.0)                                                                                                      
I0902 22:11:39.971543 2840718 nvc_info.c:713] listing device /dev/nvidia6 (GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0 at 00000000:85:00.0)                                                                                                      
I0902 22:11:39.979406 2840718 nvc_info.c:713] listing device /dev/nvidia7 (GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28 at 00000000:89:00.0)                                                                                                      
I0902 22:11:39.979522 2840718 nvc_mount.c:366] mounting tmpfs at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia                                    
I0902 22:11:39.980084 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-smi                      
I0902 22:11:39.980181 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-debugdump          
I0902 22:11:39.980273 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-persistenced     
I0902 22:11:39.980360 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-cuda-mps-control    
I0902 22:11:39.980443 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-cuda-mps-server     
I0902 22:11:39.980696 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02                                                                                                                                                                                              
I0902 22:11:39.980795 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02                                                                                                                                                                                            
I0902 22:11:39.980919 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02                                                                                                                                                                                                                                    I0902 22:11:39.981004 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02                                                                                                                                                                                                                  I0902 22:11:39.981090 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02                                                                                                                                                                                                  I0902 22:11:39.981182 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02                                                                                                                                                                                                            I0902 22:11:39.981272 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02                                                                                                                                                                                                              I0902 22:11:39.981314 2840718 nvc_mount.c:527] creating symlink /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1                          I0902 22:11:39.981482 2840718 nvc_mount.c:134] mounting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libcuda.so.470.129.06 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so.470.129.06                                                                                                                                I0902 22:11:39.981569 2840718 nvc_mount.c:134] mounting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libnvidia-ptxjitcompiler.so.470.129.06 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.129.06                                                                                              I0902 22:11:39.981887 2840718 nvc_mount.c:85] mounting /usr/lib/firmware/nvidia/510.68.02/gsp.bin at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/lib/firmware/nvidia/510.68.02/gsp.bin with flags 0x7                                                                                                                                                                                                                                  I0902 22:11:39.981971 2840718 nvc_mount.c:230] mounting /dev/nvidiactl at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidiactl                                                      I0902 22:11:39.982876 2840718 nvc_mount.c:230] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia-uvm                                                    I0902 22:11:39.983470 2840718 nvc_mount.c:230] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia-uvm-tools                                        I0902 22:11:39.983976 2840718 nvc_mount.c:230] mounting /dev/nvidia0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia0                                                          I0902 22:11:39.984099 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:04:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:04:00.0        I0902 22:11:39.984695 2840718 nvc_mount.c:230] mounting /dev/nvidia1 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia1                                                          I0902 22:11:39.984812 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:05:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:05:00.0        I0902 22:11:39.985425 2840718 nvc_mount.c:230] mounting /dev/nvidia2 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia2                                                          I0902 22:11:39.985541 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:08:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:08:00.0        I0902 22:11:39.986207 2840718 nvc_mount.c:230] mounting /dev/nvidia3 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia3                                                          I0902 22:11:39.986322 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:09:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:09:00.0        I0902 22:11:39.986963 2840718 nvc_mount.c:230] mounting /dev/nvidia4 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia4                                                          I0902 22:11:39.987076 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:0b:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:0b:00.0        I0902 22:11:39.987794 2840718 nvc_mount.c:230] mounting /dev/nvidia5 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia5                                                          I0902 22:11:39.987907 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:84:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:84:00.0        I0902 22:11:39.988593 2840718 nvc_mount.c:230] mounting /dev/nvidia6 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia6                                                          I0902 22:11:39.988707 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:85:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:85:00.0        I0902 22:11:39.989388 2840718 nvc_mount.c:230] mounting /dev/nvidia7 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia7                                                          I0902 22:11:39.989515 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:89:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:89:00.0        I0902 22:11:39.990197 2840718 nvc_ldcache.c:372] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged                                                  I0902 22:11:40.012422 2840718 nvc.c:434] shutting down library context                                                                                                                                                                        I0902 22:11:40.012510 2840726 rpc.c:95] terminating nvcgo rpc service                                                                                                                                                                         I0902 22:11:40.013110 2840718 rpc.c:135] nvcgo rpc service terminated successfully                                                                                                                                                            I0902 22:11:40.018693 2840725 rpc.c:95] terminating driver rpc service                                                                                                                                                                        I0902 22:11:40.018995 2840718 rpc.c:135] driver rpc service terminated successfully           
  • Docker command, image and tag used
docker run --gpus all --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
nvidia-smi 

Other open issues

NVIDIA/nvidia-container-toolkit#251 but this is using cgroup v1
#1661 there isn't any information posted and it's on Ubuntu 20.04 instead of 22.04

Important notes / workaround

containerd.io v1.6.7 or v1.6.8 even with no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specifying the devices to docker run gives Failed to initialize NVML: Unknown Error after a systemctl daemon-reload.

Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash

@elezar Previously persistence mode was off, so this happens either way.

Also, on k8s-device-plugin/issues/289 @klueska said:
The only thing we've seen that fully resolves the issue is to upgrade to an "experimental" version of our NVIDIA container runtime that bypasses the need for libnvidia-container to change cgroup permissions out from underneath runC.
Was that merged, or is it something I should try?

@kevin-bockman the experimental mode is still a work in progress and we don't have a concrete timeline on when this will be available for testing. I will update the issue here as soon as I have more information.

The other option is to move to cgroupv2. Since devices are not an actual subsytem in cgroupv2, there is no chance for containerd to undo what libnvidia-container has done under the hood after a restart.

@klueska Sorry, with all of the information, it wasn't really clear. The problem is that it's already on cgroupv2 AFAIK. I started from a fresh install of Ubuntu 22.04.1. docker info says it is at least.

The only way I could get this to work after a systemctl daemon-reload is downgrading containerd.io to 1.6.6 and specify no-cgroups. The other interesting thing is with containerd v1.6.7 or v1.6.8, even specifying no-cgroups still had the issue so I'm wondering if there's more than 1 issue here. I know cgroup v2 has 'fixed' the issue for some people or so they think (this can be an intermittent issue if you don't know that the reload triggers it), but it hasn't seemed to fix it for everyone unless I'm missing something but it doesn't work on a fresh install after doing a daemon reload, or just waiting for something to be triggered by the OS.

$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.2-docker)

Server:
 Containers: 4
  Running: 4
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.0-46-generic
 Operating System: Ubuntu 22.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 94.36GiB
 Name: node5-4
 ID: PPB6:APYD:PKMA:BIOZ:2Y3H:LZUV:TPHD:SBZE:XRSL:NJCB:PWMX:ZVBY
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

@kevin-bockman I had a similar experience.

In my case,

docker run -it --device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia-uvm:/dev/nvidia-uvm  \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia1:/dev/nvidia1 \
--device /dev/nvidia2:/dev/nvidia2 \
--device /dev/nvidia3:/dev/nvidia3 \
--name <container_name> <image_name>
(Replace/repeat nvidia0 with other/more devices as needed.)

This setting is working in some machines and not working in other machines.
Finally, I found that working machines has containerd.io version 1.4.6-1 (ubuntu 18.04)!!!
In ubuntu 20.04 machine, containerd.io which has version 1.5.2-1 makes it work.

I tried to downgrade and upgrade the version of containerd.io to check this strategy works or not.
It works for me.

Above one is not the answer...

This prevents nmvl error from docker resource update, but nvml error still occurs after random amount of time.

Same issue. Ubuntu 22,docker ce. I will just end up writing a cron job script to check for the error and restart the container

The solution proposed by @kevin-bockman has been working without any problem for more than a month now.

Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash

I am using docker-ce on Ubuntu 22, so I opted for this approach, working fine so far.

myron commented

same issue on Nvidia 3090
Ubuntu 22.04.1 LTS, Driver Version: 510.85.02 CUDA Version: 11.6

Hello there.

I'm hitting the same issue here, but with containerd rather than docker.

Here's my configuration:

  • GPUs:

     # lspci | grep -i nvidia
     00:04.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
  • OS:

     # cat /etc/lsb-release
     DISTRIB_ID=Ubuntu
     DISTRIB_RELEASE=22.04
     DISTRIB_CODENAME=jammy
     DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
  • containerd release:

     # containerd --version
     containerd containerd.io 1.6.8 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
  • nvidia-container-toolkit version:

     # nvidia-container-toolkit -version
     NVIDIA Container Runtime Hook version 1.11.0
     commit: d9de4a0
  • runc version:

    # runc --version
    runc version 1.1.4
    commit: v1.1.4-0-g5fd4c4d
    spec: 1.0.2-dev
    go: go1.17.13
    libseccomp: 2.5.1

Note that the Nvidia's container toolkit has been installed with the Nvidia's GPU operator on Kubernetes (v1.25.3).

I attached the containerd configuration file and the nvidia-container-runtime configuration file to my comment.
containerd.txt
nvidia-container-runtime.txt

How I reproduce this bug:

Running on my host the following command:

# nerdctl run -n k8s.io --runtime=/usr/local/nvidia/toolkit/nvidia-container-runtime --network=host --rm -ti --name ubuntu --gpus all -v /run/nvidia/driver/usr/bin:/tmp/nvidia-bin docker.io/library/ubuntu:latest bash

After some time, the nvidia-smicommand exits with the error Failed to initialize NVML: Unknown Error.

Traces, logs, etc...

  • Here are the devices listed in the state.json file:
      {
         "type": 99,
         "major": 195,
         "minor": 255,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidiactl",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 234,
         "minor": 0,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia-uvm",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 234,
         "minor": 1,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia-uvm-tools",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 195,
         "minor": 254,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia-modeset",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 195,
         "minor": 0,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia0",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       }

Thank you very much for your help. πŸ™

Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.

https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.

https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

@gengwg Can you try if your solution works by calling sudo systemctl daemon-reload on the host? In my case (cgroupv1), it is directly breaking the pod ; so from the pod, nvidia-smi is returning Failed to initialize NVML: Unknown Error.

yes. that's actually the first thing i tested when upgraded v1 --> v2. it's easy to test, because it doesn't need wait a few hours/days.

to double check, i just tested it again right now.

Before:

$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)

Do the reload on that node itself:

# systemctl daemon-reload

After:

$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)

I will update the note to reflect this test too.

And I can also confirm that's what I saw on our cgroupv1 nodes too, i.e. sudo systemctl daemon-reload immediately breaks nvidia-smi.

Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.

https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

Hi, what's your cgroup driver for kubelet and containerd? We meed the same problem in cgroup v2, our cgroup driver is systemd, but if we switch the cgroup driver to cgroupfs, the problem will disappear. I think it's the systemd cgroup driver cause the problem.

Also, if we switch the cgroup driver of docker to cgroupfs, it will also solve the problem.

Important notes / workaround

containerd.io v1.6.7 or v1.6.8 even with no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specifying the devices to docker run gives Failed to initialize NVML: Unknown Error after a systemctl daemon-reload.

Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash

I've also tried this way, the reason why containerd 1.6.7 can't work is because runc has been updated to 1.1.3, in this version runc will ignore some char devices can't be os.Stat in this PR. Unfortunately, the GPU related device is that kind of device, so it will go wrong.

@gengwg Thanks for sharing your document. As I run my kubernetes cluster on ubuntu 22.04, cgroupv2 is the default cgroup subsystem used.

I deployed two environments to help me making some comparisons:

  • One environment is running kubernetes v1.25.3, with Nvidia's GPU operator
  • One environment with only containerd & nvidia-container-toolkit

Interestingly, I never face this issue on the second environment, everything is running perfectly well.

The first environment though is running into this issue after some time.

That would probably means that Nvidia's container runtime isn't the faulty component here, but it needs more investigations on my side to be sure that I'm not missing anything.

I'll have a look at the cgroup driver as @panli889 mentioned.

Thanks again for your help

cgroup driver for kubelet, docker and containerd are all systemd. In fact, in cgroupv1 we used to use cgroupfs, but kubelet won't start, complaining mismatch between kubelet and docker cgroup drivers. After that I changed the docker (and containerd) cgroup driver to systemd, kubelet was able to start.

# cat /etc/systemd/system/kubelet.service | grep -i cgroup
  --runtime-cgroups=/systemd/system.slice \
  --kubelet-cgroups=/systemd/system.slice \
  --cgroup-driver=systemd \

We are in the middle of migrating docker to containerd, so we have both docker and containerd nodes. This seem fixed it for BOTH.

Docker nodes:

# docker info | grep -i cgroup
WARNING: No swap limit support
 Cgroup Driver: systemd
 Cgroup Version: 2
  cgroupns

Containerd nodes:

$ sudo crictl info | grep -i cgroup
            "SystemdCgroup": true
            "SystemdCgroup": true
    "systemdCgroup": false,
    "disableCgroup": false,

Here is our k8s version:

$ k version --short
Client Version: v1.21.3
Server Version: v1.22.9

@gengwg Thanks for sharing your document. As I run my kubernetes cluster on ubuntu 22.04, cgroupv2 is the default cgroup subsystem used.

I deployed two environments to help me making some comparisons:

  • One environment is running kubernetes v1.25.3, with Nvidia's GPU operator
  • One environment with only containerd & nvidia-container-toolkit

Interestingly, I never face this issue on the second environment, everything is running perfectly well.

The first environment though is running into this issue after some time.

That would probably means that Nvidia's container runtime isn't the faulty component here, but it needs more investigations on my side to be sure that I'm not missing anything.

I'll have a look at the cgroup driver as @panli889 mentioned.

Thanks again for your help

I think ours is similar to your 2nd env, i.e. containerd & nvidia-container-toolkit. we are on k8s v1.22.9.

# containerd --version
containerd containerd.io 1.6.6 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1

# dnf info nvidia-container-toolkit | grep Version
Version      : 1.11.0

i posted cgroup driver info above.

@gengwg thx for your reply!

cgroup driver for kubelet, docker and containerd are all systemd.

Hmm, that's interesting, it's quite different from my situation. Would you please share your systemd version?

I can share the problems we meet, if we create a pod with GPU, there will be a related systemd scope created at the same time like cri-containerd-xxxxxx.scope, and it records the cgroup info, if we run systemctl status to check the status:

Warning: The unit file, source configuration file or drop-ins of cri-containerd-xxxxx.scope changed on disk. Run 'systemctl daemon-reload' to  reload units.
● cri-containerd-xxx.scope - libcontainer container xxxx
     Loaded: loaded (/run/systemd/transient/cri-containerd-xxxx.scope; transient)
  Transient: yes
    Drop-In: /run/systemd/transient/cri-containerd-xxxxx.scope.d
             └─50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUWeight.conf, 50-CPUQuotaPeriodSec.conf, 50-CPUQuota.conf, 50-AllowedCPUs.conf
     Active: active (running) since Fri 2022-11-25 12:13:33 +08; 1min 47s ago
         IO: 404.0K read, 0B written
      Tasks: 1
     Memory: 528.0K
        CPU: 2.562s
     CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb6b36d39_ef5b_4eb9_850d_d710bbd06096.slice/cri-containerd-xxx.scope>
             └─61265 sleep infinity

And if we check the content of file 50-DeviceAllow.conf, we found no GPU devices info in there. Then if we run systemctl daemon-reload, if will reproduce a ebpf cgroup program about the devices, and it will block the access of GPU devices.

So would you please also take a look at the content of DeviceAllow.conf for some systemd scope of pod, what's in there?

Same issue with 2 x Nvidia 3090 Ti, Ubuntu 22.04.1 LTS, Driver Version: 510.85.02, CUDA Version: 11.6
I adopted the solution proposed by @kevin-bockman downgrading containerd.io from 1.6.10 to 1.6.6. After running systemctl daemon-reload on the host machine the nvidia-smi within the container still works properly. I will check how long it lasts and I'll keep you updated.

@panli889 I checked the scope unit with systemctl status, and this message popped up:

Warning: The unit file, source configuration file or drop-ins of cri-containerd-d35333ac42f1e08a33632fccd63028a28443f95f3c126860a8c9da20b6d27102.scope changed on disk. Run 'systemctl daemon-reload' to reload units.

After running systemctl daemon-reload, I get the error on my container:

root@ubuntu:/# nvidia-smi
Failed to initialize NVML: Unknown Error

Here's the content of the 50-DeviceAllow.conf file:

[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

There's indeed no reference to nvidia's devices here:

crw-rw-rw- 1 root root 195, 254 Nov 29 10:18 nvidia-modeset
crw-rw-rw- 1 root root 234,   0 Nov 29 10:18 nvidia-uvm
crw-rw-rw- 1 root root 234,   1 Nov 29 10:18 nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Nov 29 10:18 nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 29 10:18 nvidiactl

nvidia-caps:
total 0
cr-------- 1 root root 237, 1 Nov 29 10:18 nvidia-cap1
cr--r--r-- 1 root root 237, 2 Nov 29 10:18 nvidia-cap2

@fradsj thanks for your reply, seems the same problem as us.

Here is how we solve it, hope it will help:

  • Add --pass-device-specs=true to your k8s-device-plugin like this comment said #966 (comment) . This param will ensure GPU devices are returned by the device plugin instead of just setting the env when allocating, then the 50-DeviceAllow.conf will include GPU device info.
  • Ensure the runc version is under 1.1.3, as I mentioned above, runc 1.1.3 introduced an PR, it will ignore the GPU devices passed to runc in step one. opencontainers/runc#3671

Hi,

Any official way to fix this error ?

The official way is in the works.

It is based on using a new specification called CDI to do the GPU device injection, rather than relying a runc hook to do the GPU device injection behind the back of containerd (which is a fundamental / architectural flaw of the existing nvidia-container-runtime, and is the underlying cause of all these problems).

Until a version of both (1) the nvidia-container-runtime and (2) the k8s-device-plugin are released with proper support for CDI, you will need to rely on one of the workarounds described here.

There is no "official" workaround as such, but the workaround described in #1671 (comment) seems like the best one from my perspective. It relies on the already documented use of --pass-device-specs=true in the k8s-device-plugin (which has been the workaround for years until now) combined with downgrading to a version of runc which doesn't trigger the GPUs to be ignored.

Hmm, that's interesting, it's quite different from my situation. Would you please share your systemd version?

I can share the problems we meet, if we create a pod with GPU, there will be a related systemd scope created at the same time like cri-containerd-xxxxxx.scope, and it records the cgroup info, if we run systemctl status to check the status:

Warning: The unit file, source configuration file or drop-ins of cri-containerd-xxxxx.scope changed on disk. Run 'systemctl daemon-reload' to  reload units.
● cri-containerd-xxx.scope - libcontainer container xxxx
     Loaded: loaded (/run/systemd/transient/cri-containerd-xxxx.scope; transient)
  Transient: yes
    Drop-In: /run/systemd/transient/cri-containerd-xxxxx.scope.d
             └─50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUWeight.conf, 50-CPUQuotaPeriodSec.conf, 50-CPUQuota.conf, 50-AllowedCPUs.conf
     Active: active (running) since Fri 2022-11-25 12:13:33 +08; 1min 47s ago
         IO: 404.0K read, 0B written
      Tasks: 1
     Memory: 528.0K
        CPU: 2.562s
     CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb6b36d39_ef5b_4eb9_850d_d710bbd06096.slice/cri-containerd-xxx.scope>
             └─61265 sleep infinity

And if we check the content of file 50-DeviceAllow.conf, we found no GPU devices info in there. Then if we run systemctl daemon-reload, if will reproduce a ebpf cgroup program about the devices, and it will block the access of GPU devices.

So would you please also take a look at the content of DeviceAllow.conf for some systemd scope of pod, what's in ther

@panli889 sorry for late reply. was on vacation.

systemd version:

$ systemctl --version
systemd 239 (239-58.el8)

After spinning up a pod on a node:

$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-3836675c-e987-1f01-7ce7-12da20038909)

I don't see the systemd scope nor the DeviceAllow files.

$ find /etc/systemd/ | grep scope
$ sudo find /etc/ | grep -i DeviceAllow

Checked those on our env.

Here is how we solve it, hope it will help:

We didn't use the --pass-device-specs=true option, but we do have allowPrivilegeEscalation: false. looks not the same thing.

$ k get ds nvidia-device-plugin-daemonset -n kube-system -o yaml
....
    spec:
      containers:
      - args:
        - --fail-on-init-error=false
        image: xxxxx.com/k8s-device-plugin:v0.9.0
        imagePullPolicy: IfNotPresent
        name: nvidia-device-plugin-ctr
        resources: {}
        securityContext:
          allowPrivilegeEscalation: false # <------
          capabilities:
            drop:
            - ALL
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/device-plugins
          name: device-plugin
      dnsPolicy: ClusterFirst
....

Luckily we are right below 1.1.3. We pinned the version on the repo side through centos composes, so this should be safe if we do not advance the compose version.

$ runc --version
runc version 1.1.2
commit: v1.1.2-0-ga916309
spec: 1.0.2-dev
go: go1.17.11
libseccomp: 2.5.2

@panli889 running the device plugin with runc in v1.1.2 seems to fix the situation, as the GPUs are listed in the DeviceAllow file of the cgroup of the container:

[Scope] DeviceAllow= DeviceAllow=/dev/char/195:255 rw DeviceAllow=/dev/char/195:0 rw DeviceAllow=char-pts rwm DeviceAllow=/dev/char/10:200 rwm DeviceAllow=/dev/char/5:2 rwm DeviceAllow=/dev/char/5:0 rwm DeviceAllow=/dev/char/1:9 rwm DeviceAllow=/dev/char/1:8 rwm DeviceAllow=/dev/char/1:7 rwm DeviceAllow=/dev/char/1:5 rwm DeviceAllow=/dev/char/1:3 rwm DeviceAllow=char-* m DeviceAllow=block-* m

Thank you very much for your help !

@klueska that's suprising to see that Nvidia's GPUs are not listed in the /dev/char directory, as runc is expecting to find it in. Do you know if that's expected by Nvidia's drivers developers ?

For the CDI, do you know if the kubernetes community is working with you on this, and if there's any release cycle that has been decided yet ?

Thank you very much.

I was able to reproduce this and verify that manually creating symlinks to the various nvidia devices in /dev/char resolves the issue. I need to talk to our driver team to determine why these are not automatically created and how to get them created going forward.

At least we seem to fully understand the problem now, and know what is necessary to resolve it. In the meantime, I would recommend creating these symlinks manually to work around this issue.

KyonP commented

having almost same issue with Quadro RTX 8000 cluster server.

I hope there is quick solution before the official fix.

I have to keep restart my docker container whenever I have this issue

GPU Operator seems to have had a release that contained a workaround. NVIDIA/gpu-operator#430 (comment)

Since I am not using GPU Operator, I have a small tool that does the same thing. I can confirm that this solves the problem in my environment. https://gist.github.com/superbrothers/5bbb80e15a7f3ad994f789165dce2938

A tool will be shipping with the next release of the nvidia container toolkit later today. I’ll update here with instructions (or point at the official documentation of its ready by then).

A tool will be shipping with the next release of the nvidia container toolkit later today. I’ll update here with instructions (or point at the official documentation of its ready by then).

Hi @klueska, can you point me to the tool/instructions for resolving this issue? Thanks!

  • Deploying GPU Operator 22.9.2 will automatically fix the issue on all K8s nodes of the cluster (the fix is integrated inside the validator pod which will run when a new node is deployed or at every reboot of the node).

  • For deployments using the standalone k8s-device-plugin (i.e. not through the use of the operator), or for standalone docker, follow one of the workarounds listed below (in order of recommendation).

  1. Using the nvidia-ctk utility:
    The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in /dev/char for all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows:
sudo nvidia-ctk system create-dev-char-symlinks \
--create-all

This command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.

A simple udev rule to enforce this can be seen below:

# This will create /dev/char symlinks to all device nodes
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system create-dev-char-symlinks --create-all"

A good place to install this rule would be:
/lib/udev/rules.d/71-nvidia-dev-char.rules

In cases where the NVIDIA GPU Driver Container is used, the path to the driver installation must be specified. In this case the command should be modified to:

sudo nvidia-ctk system create-dev-symlinks \
--create-all \
–-driver-root={{NVIDIA_DRIVER_ROOT}}

Where {{NVIDIA_DRIVER_ROOT}} is the path to which the NVIDIA GPU Driver container installs the NVIDIA GPU driver and creates the NVIDIA Device Nodes.

  1. Explicitly disabling systemd cgroup management in Docker:
    Set the parameter
    "exec-opts": ["native.cgroupdriver=cgroupfs"] in the /etc/docker/daemon.json file and restart docker.

  2. Downgrading to docker.io packages where systemd is not the default cgroup manager (and not overriding that of course).

I'm going down the route of using option 1 - using nvidia-ctk as I am running standalone Docker on Debian 11 (bullseye). I've added a udev rule but I haven't rebooted to see if it runs but I have manually executed nvidia-ctk system create-dev-char-symlinks --create-all and it's created the symlinks. I'm using the driver packages directly from the Debian repos, not the GPU Driver container. If I run systemctl daemon-reload, it continues to trigger the same behavior as before where I see Failed to initialize NVML: Unknown Error messages. I've re-created my GPU containers. Is there something I am missing? Does Docker need to be restarted or is there something about the specific order of what needs to happen when outside of the kernel module needing to be loaded before creating the symlinks?

Can you show me your docker command?

Note: this does not address the issue where you still need to explicitly pass the device nodes for /dev/nvidia0, /dev/nvidia1, /dev/nvidiactl on the command line (that won’t be fixed until CDI support is added to docker).

This fixes the issue where β€” even if you do explicitly pass the device nodes β€” you STILL lose access to the GPUs on a systemctl daemon reload.

Sure thing:

docker run -d \
  --restart unless-stopped \
  --name nvidia-smi-rest \
  --gpus 'all,"capabilities=utility"' \
  --cpus 1 \
  --memory 1g \
  --memory-swap 1.5g \
  mbentley/nvidia-smi-rest

/etc/docker/daemon.json:

{
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "storage-driver": "overlay2"
}

With systemd cgroup management you must always pass the nvidia device nodes on the docker command line (which you are not doing).

Meaning you would need to run:

docker run -d \
  --restart unless-stopped \
  --name nvidia-smi-rest \
  --gpus 'all,"capabilities=utility"' \
  --device /dev/nvidiactl \
  --device /dev/nvidia0 \
  ...
  --cpus 1 \
  --memory 1g \
  --memory-swap 1.5g \
  mbentley/nvidia-smi-rest

This is due to the way GPU injection currently happens from within a runc hook when the --gpus flag is used. The hook manually sets up the cgroups for the NVIDIA devices behind the back of docker/containerd/runc -- so when a systemd daemone-reload happens the cgroup access for these devices gets undone (because these runtimes had no way of telling systemd that these devices had been injected by the hook and the reload triggers it to reevaluate all cgroup rules).

This issue only started to be noticed by most people recently because the latest release of docker flipped to using systemd cgroup management by default (as opposed to cgroupfs).

The good news is, once CDI support is added to docker, this won't be necessary anymore.
docker/cli#3864

Ah, thanks @klueska - makes sense and works as expected. Thanks again!

The fix with the /dev/char symlink creation works fine, thanks.
But now we also need to set PASS_DEVICE_SPECS=true which wasn't the case before. From the documentation it was only needed if we wanted to interoperate with the CPUManager in Kubernetes, and requieres to deploy the daemonset with elevated privileges. Why is setting this var needed ?

@cdrcnm yes, that is now necessary and the documentation should be updated. It's needed now for the same reasons described in my comment above: #1671 (comment).

Note: this is an unfortunate truth for the moment and will go away once CDI becomes the standard for device injection in containerized environments (and we update the device plugin to support CDI as well). CDI support has already been added to cri-o and containerd and we are in the process of making the nvidia device plugin CDI aware. Once all the pieces are in place we will update our documentation to instruct people on how to use it.

@klueska ran into this after we fixed a similar containerd/runc issue.

We're running Kubernetes on A100s where the DGXOS distribution doesn't bake in 1.12.X of the ctk.

Is there any other options that doesn't involve manual char device creation to get people over the line?

We'll probably end up upgrading the gpu operator but it's going to be breaking between the version we currently run and the version this suggests so thinking about doing workaround first and planning that out further.

Hmmm I did a little script to create the device links:

BASE=/dev/char
for d in $( cd $BASE && find ../nvidia* -type c ); do
  MAJOR_HEX=$(stat -c %t $BASE/$d)
  MINOR_HEX=$(stat -c %T $BASE/$d)
  MAJOR_DEC=$((16#$MAJOR_HEX))
  MINOR_DEC=$((16#$MINOR_HEX))
  ln -s $d /dev/char/${MAJOR_DEC}:${MINOR_DEC}
done

The bounced k3s / containerd and container.

But still get this in the container after daemon-reload:

$ nvidia-smi
Failed to initialize NVML: Unknown Error

Our environment is running k3s with containerd and gpu operator 1.11.1. We use the accept-nvidia-visible-devices-as-volume-mounts feature of the container runtime on each host to allow a pod to share devices between containers in the same pod.

Actually symbolic links do work but only for the container that originally gets the GPU devices.

It just drops out on the sidecar container which shares the GPU by reading in the GPU devices from config map that the main container writes to on startup. See here on how we use: https://github.com/harrison-ai/cobalt-docker-rootless-nvidia-dind

Would I need to manually adjust an allow list so it doesn't drop GPU device on the sidecar when there's a daemon-reload? We actually don't care about cgroup control for these devices. It's just about soft blocking them so users don't trip over each other.

dind container ( main):

cat /sys/fs/cgroup/devices/devices.list
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm
c 195:0 rw
c 195:254 rw
c 195:255 rw
c 511:0 rw
c 511:1 rw

workspace container (secondary):

cat /sys/fs/cgroup/devices/devices.list
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm

I can manually echo into the cgroup devices.allow and things start working again but that's not ideal.

we found this case also when we upgrade ubuntu16 (kernel 4.9) to ubuntu 20 (kernel 5.4)!

docker version 20.10.7
containerd version 1.4.6
runc version rc-95
native.cgroupdriver=systemd (docker and k8s recommand a long time , I think mostly cluster use itοΌ‰


Nothing change , Why systemctl daemon-reload container lose GPU Device ........

I notice systemd version in runc lastest version issue opencontainers/runc#3708

from ubuntu16 kernel 4.9 to ubuntu 20 kernel 5.4 , systemd version upgrade from systemd 229 to systemd 245!

ubuntu 16  (kernel 4.9)    systemd 229    cgroup v1

ubuntu 20  (kernel 5.4)    systemd 245    cgroup  v1 (default-hierarchy=hybrid)

So there are 3 main factor :

  1. device plugin --pass-device options
  2. runc version
  3. systemd version

I test it with same case found diff systemd version diff result to handle system scope config with cgoup config when daemon-reload

  1. systemd add device A, device A can not find with stat(2), cgroup add device A
    when systemctl daemon-reload:
    a. systemd 229 clear cgroup device A
    b. systemd 245 do nothing

  2. systemd add device A, device A cat find with stat(2), cgroup add device A
    when systemctl daemon-reload:
    a. systemd 229 do nothing
    b. systemd 245 do nothing

  3. systemd not add device A, device A stat(2) do not care (find or not find), cgroup add device A
    when systemctl daemon-reload:
    a. systemd 229 do nothing
    b. systemd 245 clear cgroup device A

With this special different system result:

we k8s cluster with --pass-device=false systemd 229 runc rc-95 should meet case 3, so systemctl daemon-reload work success ! but we upgrade to systemd 245, systemctl daemon-reload break container device list


Of course , different runc version how to handle Device with Systemd make this issue more mystery ! eg:

  1. before runc rc92, runc do not sync device with systemd
  2. should add an not existed device path to systemd? opencontainers/runc#3671
    (has been fix in this issue issue https://github.com/opencontainers/runc/issues/3708 to check systemd version 240 , maybe start systemd 240 change some .. )

with more clear , I draw an map about it, maybe help

图片

There is an issue out against runc discussed here opencontainers/runc#3708 (comment) that also discusses this. According to the author there were fixes merged into both main and release-1.1. Do your experiments contain these fixes?

I verified them yesterday, although I always passed device nodes in my tests.

There is an issue out against runc discussed here opencontainers/runc#3708 (comment) that also discusses this. According to the author there were fixes merged into both main and release-1.1. Do your experiments contain these fixes?

I verified them yesterday, although I always passed device nodes in my tests.

The new released version runc 1.1.7, fix about how to handle /dev/char/xx existed or not .....

with this new fixes;

 `pass-device`  +  `/dev/char/xx not existed` +  `systemd 229 ( < 240)`       reload success 
 
 `pass-device`  +  `/dev/char/xx not existed` +  `systemd 245 ( >= 240)`    reload success
   
 `pass-device`  +  `/dev/char/xx existed` +  `systemd 229 ( < 240)`    reload success
 
 `pass-device`  + "/dev/char/xx existed" +  `systemd 245 ( >= 240)`  reload success

so with pass-device = true option, Nvidia GPU Driver there no need to create link /dev/char/xx ;

but when pass-device=false option, when used systemd 245 (>=240) , all runc (>= rc92) reload failed !


update map abount new runc version (1.1.7)

图片

@gaopeiliang as per #1671 (comment), when using systemd cgroup management (and newer systemd versions) it is required to pass the device nodes when launching a container. This is a separate issue from the runc bug that was fixed or for which the /dev/char symlinks were a workaround.

Hey @klueska, is pass-device-specs still required even after using the udev rule with nvidia-ctx? Or can I just use nvidia-ctx without setting pass-device-specs in the k8s device plugin?

Yes. It is still needed. The fix ensures that device access is not lost even when you use pass-device-specs.

With systemd cgroup management you must always pass the nvidia device nodes on the docker command line (which you are not doing).

Meaning you would need to run:

docker run -d \
  --restart unless-stopped \
  --name nvidia-smi-rest \
  --gpus 'all,"capabilities=utility"' \
  --device /dev/nvidiactl \
  --device /dev/nvidia0 \
  ...
  --cpus 1 \
  --memory 1g \
  --memory-swap 1.5g \
  mbentley/nvidia-smi-rest

This is due to the way GPU injection currently happens from within a runc hook when the --gpus flag is used. The hook manually sets up the cgroups for the NVIDIA devices behind the back of docker/containerd/runc -- so when a systemd daemone-reload happens the cgroup access for these devices gets undone (because these runtimes had no way of telling systemd that these devices had been injected by the hook and the reload triggers it to reevaluate all cgroup rules).

This issue only started to be noticed by most people recently because the latest release of docker flipped to using systemd cgroup management by default (as opposed to cgroupfs).

The good news is, once CDI support is added to docker, this won't be necessary anymore. docker/cli#3864

Hi @klueska @elezar , what's the suggested equivalent of the docker --devices flags for Kubernetes GPU pods using containerd?

I added pass-device-specs and created the symlinks but it didn't work for me. I am not sure how we can pass the --devices in a Pod spec. So does it mean this is an acknowleged issue for Kubernetes GPU workloads using systemd cgroup?

Update: tried runc 1.1.7 with systemd 245, but it didn't solve the issue.

@gaopeiliang as per #1671 (comment), when using systemd cgroup management (and newer systemd versions) it is required to pass the device nodes when launching a container. This is a separate issue from the runc bug that was fixed or for which the /dev/char symlinks were a workaround.

en ... we k8s cluster use old gpu-device-plugin with not support pass-device-specs ; so we should test it;

another questions, what's the mean no-cgroups = bool options in config file /etc/nvidia-container-runtime/config.toml ? any spec or link about it ?

we can use pass-device-specs + no-cgroups = true + systemd to avoid device manager problem ? @klueska @elezar

en ... we k8s cluster use old gpu-device-plugin with not support pass-device-specs ; so we should test it;

Which version are you using?

The no-cgroups option is used to control whether the NVIDIA Container Library should update the cgroups for a container to allow access to a device. For the rootless case, where a user does not have permissions to manage cgroups, this must be disabled. I don't have enough experience to know whether your proposed combination would work as expected.

@didovesei with regards to:

I added pass-device-specs and created the symlinks but it didn't work for me. I am not sure how we can pass the --devices in a Pod spec. So does it mean this is an acknowleged issue for Kubernetes GPU workloads using systemd cgroup?

How did you add the pass-device-specs option? This is an option typically set as an environment variable for the GPU device plugin. Which version of the plugin are you using?

@didovesei with regards to:

I added pass-device-specs and created the symlinks but it didn't work for me. I am not sure how we can pass the --devices in a Pod spec. So does it mean this is an acknowleged issue for Kubernetes GPU workloads using systemd cgroup?

How did you add the pass-device-specs option? This is an option typically set as an environment variable for the GPU device plugin. Which version of the plugin are you using?

Hi @elezar , I was using device plugin v0.10.0 + containerd 1.6.0 + systemd 245 + runc 1.1.7. I passed pass-device-specs in the device plugin args.

  containers:
  - args:
    - --fail-on-init-error=false
    - --mig-strategy=mixed
    - --pass-device-specs=true

I think the flag was taking effect (although not working), since now when I run nvidia-smi in the GPU Pod after a daemon-reload, it shows the below message instead of the NVML error.

root@gpu:/# nvidia-smi
No devices were found

I might be a bit unclear in my last comment but I guess my real point is that in @klueska 's comment, it was mentioned that

Note: this does not address the issue where you still need to explicitly pass the device nodes for /dev/nvidia0, /dev/nvidia1, /dev/nvidiactl on the command line (that won’t be fixed until CDI support is added to docker).

This fixes the issue where β€” even if you do explicitly pass the device nodes β€” you STILL lose access to the GPUs on a systemctl daemon reload.

AFAIU, however, in K8s context, the devices should be passed into the Pod through device plugin. So we shouldn't be expecting the user to explictly pass the /dev into the Pod. Besides, I am not sure if there is an equivalent of the docker --devices flags in a K8s Pod spec. So I was wondering given all the above points, does it mean that this is an acknowleged limitation with Nvidia K8s solution for a certain combination of configurations (like containerd+systemd+cgroup v1)?

@didovesei was the plugin running as a privileged container? This is required to pass the device nodes.

@didovesei was the plugin running as a privileged container? This is required to pass the device nodes.

@elezar It's not in privileged mode. I have been using a config similar to this one for the DP.

Is privileged mode a requirement specific to this issue, or Nvidia suggests using it for the DP in general?

See https://github.com/NVIDIA/k8s-device-plugin#setting-other-helm-chart-values (which needs an update for a disscussion on the options and setting up privileged). Privileged mode is required when passing the device specs so that the device plugin can see all the required device nodes. Otherwise it would not have the required accesss (even though this is also provided by the nvidia container toolkit).

See https://github.com/NVIDIA/k8s-device-plugin#setting-other-helm-chart-values (which needs an update for a disscussion on the options and setting up privileged). Privileged mode is required when passing the device specs so that the device plugin can see all the required device nodes. Otherwise it would not have the required accesss (even though this is also provided by the nvidia container toolkit).

Using privileged mode for DP didn't work.. But using privileged mode for user workload Pod did work. Also, it seems that as long as the user workload Pod is privileged, there aren't any problems -- DP doesn't need to be privileged, no symlinks for the char devices need to be created.

That is true, but most users don't want to run their user pods as privileged (and they shouldn't have to if everything else is set up properly).

en ... we k8s cluster use old gpu-device-plugin with not support pass-device-specs ; so we should test it;

Which version are you using?

The no-cgroups option is used to control whether the NVIDIA Container Library should update the cgroups for a container to allow access to a device. For the rootless case, where a user does not have permissions to manage cgroups, this must be disabled. I don't have enough experience to know whether your proposed combination would work as expected.

device-plugin version 1.0.0-beta

runc will also write cgroup fs if has device list ; so pass-device + no-cgroup=true can always set sucess I tested ....

Some relevant comments & solutions from @cdesiniotis at nvidia on the matter: #1730

Some relevant comments & solutions from @cdesiniotis at nvidia on the matter: #1730

Thanks @breakingflower, that's very useful.

FYI: From the Notice:

Deploying GPU Operator 22.9.2 will automatically fix the issue on all K8s nodes of the cluster (the fix is integrated inside the validator pod which will run when a new node is deployed or at every reboot of the node).

Does sound very promising but unfortunately doesn't solve the issue.

I can confirm that using the new version of GPU Operator resolves the issue when CDI is enabled in gpu-operator config:

  cdi:
    enabled: true
    default: true

However, I am facing the issue where nvidia-container-toolkit-daemonset couldn't start properly after the reboot of the machine:

  Warning  Failed          4m34s (x4 over 6m10s)  kubelet          Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=all: unknown

Any update on this?

Please see this notice from February:
#1730

@klueska

Please see this notice from February: #1730
i have seen it in detail, could explain how to get the correct {{NVIDIA_DRIVER_ROOT}} in cases where the driver container is also in use.
i am not clear, the default value in nvidia-ctk is /

pcanas commented

Is there any timeline for a solution besides the workarounds exposed in #1730 ?

I tried the suggested approach in #6380, but it didn't solve the problem. It is quite frustrating as I cannot rely on AKS at the moment. I hope this issue is solved soon.

@rogelioamancisidor we've heard that AKS ships with a really old version of the k8s-device-plugin (from 2019!) which doesn't support the PASS_DEVICE_SPECS flag. You will need to update the plugin to a newer one and pass this flag for things to work on AKS.

@klueska Here is the plugin that I got suggested in the other discussion plugin. Do you have a link for a newer k8s-device-plugin? I'll really appreciate it as I have tried different things without any luck.

elezar commented

@klueska Here is the plugin that I got suggested in the other discussion plugin and I just noticed, as you mentioned, that the plugin dates 2019. Do you have a link for a newer k8s-device-plugin? I'll really appreciate it as I have tried different things without any luck.

The plugin is available here: https://github.com/NVIDIA/k8s-device-plugin the README should cover a variety of deployment options, where helm is recommended.

The latest version of the plugin is v0.14.1.

I deployed a DaemonSet for the NVIDIA device plugin using the yaml manifest in the link that I posted. The manifest in the link includes this line - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1. isnt that manifest deploying the latest version then? PASS_DEVICE_SPECS is also set to true as suggested by AKS.

homjay commented

here is the official soluton

#1730 (comment)

modify /etc/docker/docker.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "exec-opts": ["native.cgroupdriver=cgroupfs"]
}

it is working.

modify /etc/docker/docker.json

Isn't it /etc/docker/daemon.json?

@homjay I dont think that solution works on K8s

elezar commented

This is an issue as described in NVIDIA/nvidia-container-toolkit#48

Since this issue has a number of different failure modes discussed, I'm going to close this issue and ask that those still having a problem open new issues in the respective repositories.

We are looking to migrate all issues in this repo to https://github.com/NVIDIA/nvidia-container-toolkit in the near term.