NVIDIA/nvidia-docker

nvidia-container-cli: mount error: file creation failed: xxx/merged/run/nvidia-persistenced/socket: no such device or address: unknown.

WulixuanS opened this issue · 6 comments

1. Issue or feature description

nvidia-docker mount error when Persistence-Mode is on

docker run  --runtime=nvidia  -e NVIDIA_VISIBLE_DEVICES=all  -v /var/run:/var/run -it debian bash
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #1:: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: file creation failed: /data/docker/overlay2/8a72121bd999b74d25be8c84cc2e4951dde8427dbcbed9f3efbc6782950f6233/merged/run/nvidia-persistenced/socket: no such device or address: unknown.
ERRO[0001] error waiting for container: context canceled

2. Steps to reproduce the issue

  1. nvidia Persistence-Mode is on
  2. docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -v /var/run:/var/run -it debian bash

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
I1011 02:47:03.967588 29866 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977)
I1011 02:47:03.967631 29866 nvc.c:350] using root /
I1011 02:47:03.967638 29866 nvc.c:351] using ldcache /etc/ld.so.cache
I1011 02:47:03.967643 29866 nvc.c:352] using unprivileged user 65534:65534
I1011 02:47:03.967662 29866 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1011 02:47:03.967730 29866 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I1011 02:47:03.971088 29867 nvc.c:278] loading kernel module nvidia
I1011 02:47:03.971202 29867 nvc.c:282] running mknod for /dev/nvidiactl
I1011 02:47:03.971233 29867 nvc.c:286] running mknod for /dev/nvidia0
I1011 02:47:03.971254 29867 nvc.c:286] running mknod for /dev/nvidia1
I1011 02:47:03.971273 29867 nvc.c:286] running mknod for /dev/nvidia2
I1011 02:47:03.971292 29867 nvc.c:286] running mknod for /dev/nvidia3
I1011 02:47:03.971312 29867 nvc.c:286] running mknod for /dev/nvidia4
I1011 02:47:03.971330 29867 nvc.c:286] running mknod for /dev/nvidia5
I1011 02:47:03.971348 29867 nvc.c:286] running mknod for /dev/nvidia6
I1011 02:47:03.971367 29867 nvc.c:286] running mknod for /dev/nvidia7
I1011 02:47:03.971385 29867 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I1011 02:47:03.976688 29867 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I1011 02:47:03.976772 29867 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I1011 02:47:03.980977 29867 nvc.c:296] loading kernel module nvidia_uvm
I1011 02:47:03.981009 29867 nvc.c:300] running mknod for /dev/nvidia-uvm
I1011 02:47:03.981058 29867 nvc.c:305] loading kernel module nvidia_modeset
I1011 02:47:03.981094 29867 nvc.c:309] running mknod for /dev/nvidia-modeset
I1011 02:47:03.981356 29868 rpc.c:71] starting driver rpc service
I1011 02:47:03.991209 29869 rpc.c:71] starting nvcgo rpc service
I1011 02:47:03.992027 29866 nvc_info.c:766] requesting driver information with ''
I1011 02:47:03.993218 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.470.129.06
I1011 02:47:03.993319 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.129.06
I1011 02:47:03.993369 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.129.06
I1011 02:47:03.993400 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.129.06
I1011 02:47:03.993429 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.129.06
I1011 02:47:03.993470 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.129.06
I1011 02:47:03.993511 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.129.06
I1011 02:47:03.993543 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.470.129.06
I1011 02:47:03.993571 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.129.06
I1011 02:47:03.993612 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.470.129.06
I1011 02:47:03.993652 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.470.129.06
I1011 02:47:03.993680 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.470.129.06
I1011 02:47:03.993707 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.470.129.06
I1011 02:47:03.993737 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.470.129.06
I1011 02:47:03.993775 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.129.06
I1011 02:47:03.993814 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.470.129.06
I1011 02:47:03.993843 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.129.06
I1011 02:47:03.993872 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.129.06
I1011 02:47:03.993912 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.470.129.06
I1011 02:47:03.993940 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.129.06
I1011 02:47:03.993979 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.129.06
I1011 02:47:03.994116 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.470.129.06
I1011 02:47:03.994210 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.470.129.06
I1011 02:47:03.994241 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.470.129.06
I1011 02:47:03.994271 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.470.129.06
I1011 02:47:03.994303 29866 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.470.129.06
W1011 02:47:03.994326 29866 nvc_info.c:399] missing library libnvidia-nscq.so
W1011 02:47:03.994332 29866 nvc_info.c:399] missing library libcudadebugger.so
W1011 02:47:03.994337 29866 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W1011 02:47:03.994346 29866 nvc_info.c:399] missing library libnvidia-pkcs11.so
W1011 02:47:03.994351 29866 nvc_info.c:403] missing compat32 library libnvidia-ml.so
W1011 02:47:03.994358 29866 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W1011 02:47:03.994363 29866 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W1011 02:47:03.994369 29866 nvc_info.c:403] missing compat32 library libcuda.so
W1011 02:47:03.994376 29866 nvc_info.c:403] missing compat32 library libcudadebugger.so
W1011 02:47:03.994381 29866 nvc_info.c:403] missing compat32 library libnvidia-opencl.so
W1011 02:47:03.994388 29866 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so
W1011 02:47:03.994393 29866 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W1011 02:47:03.994399 29866 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W1011 02:47:03.994406 29866 nvc_info.c:403] missing compat32 library libnvidia-compiler.so
W1011 02:47:03.994413 29866 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W1011 02:47:03.994420 29866 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W1011 02:47:03.994428 29866 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W1011 02:47:03.994434 29866 nvc_info.c:403] missing compat32 library libnvidia-encode.so
W1011 02:47:03.994440 29866 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so
W1011 02:47:03.994446 29866 nvc_info.c:403] missing compat32 library libnvcuvid.so
W1011 02:47:03.994453 29866 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W1011 02:47:03.994459 29866 nvc_info.c:403] missing compat32 library libnvidia-glcore.so
W1011 02:47:03.994464 29866 nvc_info.c:403] missing compat32 library libnvidia-tls.so
W1011 02:47:03.994473 29866 nvc_info.c:403] missing compat32 library libnvidia-glsi.so
W1011 02:47:03.994480 29866 nvc_info.c:403] missing compat32 library libnvidia-fbc.so
W1011 02:47:03.994486 29866 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W1011 02:47:03.994492 29866 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W1011 02:47:03.994500 29866 nvc_info.c:403] missing compat32 library libnvoptix.so
W1011 02:47:03.994506 29866 nvc_info.c:403] missing compat32 library libGLX_nvidia.so
W1011 02:47:03.994512 29866 nvc_info.c:403] missing compat32 library libEGL_nvidia.so
W1011 02:47:03.994520 29866 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so
W1011 02:47:03.994525 29866 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so
W1011 02:47:03.994533 29866 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so
W1011 02:47:03.994537 29866 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I1011 02:47:03.995419 29866 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I1011 02:47:03.995435 29866 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I1011 02:47:03.995451 29866 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I1011 02:47:03.995475 29866 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I1011 02:47:03.995491 29866 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W1011 02:47:03.995540 29866 nvc_info.c:425] missing binary nv-fabricmanager
I1011 02:47:03.995564 29866 nvc_info.c:343] listing firmware path /lib/firmware/nvidia/470.129.06/gsp.bin
I1011 02:47:03.995586 29866 nvc_info.c:529] listing device /dev/nvidiactl
I1011 02:47:03.995592 29866 nvc_info.c:529] listing device /dev/nvidia-uvm
I1011 02:47:03.995598 29866 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I1011 02:47:03.995606 29866 nvc_info.c:529] listing device /dev/nvidia-modeset
I1011 02:47:03.995631 29866 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W1011 02:47:03.995649 29866 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W1011 02:47:03.995663 29866 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I1011 02:47:03.995671 29866 nvc_info.c:822] requesting device information with ''
I1011 02:47:04.001966 29866 nvc_info.c:713] listing device /dev/nvidia0 (GPU-3edc0d81-2467-bfe5-ba05-f932ffbbd171 at 00000000:18:00.0)
I1011 02:47:04.008266 29866 nvc_info.c:713] listing device /dev/nvidia1 (GPU-81ccb748-25c7-6baa-5375-f5a8dead8275 at 00000000:3b:00.0)
I1011 02:47:04.014675 29866 nvc_info.c:713] listing device /dev/nvidia2 (GPU-478ac85a-738a-e7db-3a72-6e9146635e3f at 00000000:5e:00.0)
I1011 02:47:04.021147 29866 nvc_info.c:713] listing device /dev/nvidia3 (GPU-40492895-e696-62ba-d470-7aab3444f75f at 00000000:5f:00.0)
I1011 02:47:04.027766 29866 nvc_info.c:713] listing device /dev/nvidia4 (GPU-c0a5ec57-f0ed-6f06-fd10-e9377897f593 at 00000000:86:00.0)
I1011 02:47:04.034507 29866 nvc_info.c:713] listing device /dev/nvidia5 (GPU-5a48523d-b55e-5b4e-64fd-9d9546f8d81e at 00000000:87:00.0)
I1011 02:47:04.041346 29866 nvc_info.c:713] listing device /dev/nvidia6 (GPU-c46d7cf0-ae45-d572-1945-b542f1f3d6ec at 00000000:af:00.0)
I1011 02:47:04.048295 29866 nvc_info.c:713] listing device /dev/nvidia7 (GPU-cae80e67-3b91-d352-b380-b52468a7abc3 at 00000000:d8:00.0)
NVRM version:   470.129.06
CUDA version:   11.4

Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-3edc0d81-2467-bfe5-ba05-f932ffbbd171
Bus Location:   00000000:18:00.0
Architecture:   7.5

Device Index:   1
Device Minor:   1
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-81ccb748-25c7-6baa-5375-f5a8dead8275
Bus Location:   00000000:3b:00.0
Architecture:   7.5

Device Index:   2
Device Minor:   2
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-478ac85a-738a-e7db-3a72-6e9146635e3f
Bus Location:   00000000:5e:00.0
Architecture:   7.5

Device Index:   3
Device Minor:   3
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-40492895-e696-62ba-d470-7aab3444f75f
Bus Location:   00000000:5f:00.0
Architecture:   7.5

Device Index:   4
Device Minor:   4
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-c0a5ec57-f0ed-6f06-fd10-e9377897f593
Bus Location:   00000000:86:00.0
Architecture:   7.5

Device Index:   5
Device Minor:   5
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-5a48523d-b55e-5b4e-64fd-9d9546f8d81e
Bus Location:   00000000:87:00.0
Architecture:   7.5

Device Index:   6
Device Minor:   6
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-c46d7cf0-ae45-d572-1945-b542f1f3d6ec
Bus Location:   00000000:af:00.0
Architecture:   7.5

Device Index:   7
Device Minor:   7
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-cae80e67-3b91-d352-b380-b52468a7abc3
Bus Location:   00000000:d8:00.0
Architecture:   7.5
I1011 02:47:04.048434 29866 nvc.c:434] shutting down library context
I1011 02:47:04.048460 29869 rpc.c:95] terminating nvcgo rpc service
I1011 02:47:04.048936 29866 rpc.c:135] nvcgo rpc service terminated successfully
I1011 02:47:04.050552 29868 rpc.c:95] terminating driver rpc service
I1011 02:47:04.050681 29866 rpc.c:135] driver rpc service terminated successfully
  • Kernel version from uname -a
   5.10.0-72 #1b2c9d41a474 SMP Tue Oct 12 15:45:46 HKT 2021 x86_64 GNU/Linux
  • Any relevant kernel output lines from dmesg
    /
  • Driver information from nvidia-smi -a
   /
  • Docker version from docker version
Client: Docker Engine - Community
 Version:           19.03.15
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        99e3ed8919
 Built:             Sat Jan 30 03:17:05 2021
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.15
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       99e3ed8919
  Built:            Sat Jan 30 03:15:34 2021
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.4.9
  GitCommit:        e25210fe30a0a703442421b0f60afac609f950a3
 nvidia:
  Version:          1.0.1
  GitCommit:        v1.0.1-0-g4144b63
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                       Version                    Architecture               Description
+++-==========================================-==========================-==========================-=========================================================================================
ii  libnvidia-container-tools                  1.11.0-1                   amd64                      NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                 1.11.0-1                   amd64                      NVIDIA container runtime library
ii  nvidia-container-runtime                   3.11.0-1                   all                        NVIDIA container runtime
un  nvidia-container-runtime-hook              <none>                     <none>                     (no description available)
ii  nvidia-container-toolkit                   1.11.0-1                   amd64                      NVIDIA Container toolkit
ii  nvidia-container-toolkit-base              1.11.0-1                   amd64                      NVIDIA Container Toolkit Base
un  nvidia-docker                              <none>                     <none>                     (no description available)
ii  nvidia-docker2                             2.11.0-1                   all                        nvidia-docker CLI wrapper
un  nvidia-legacy-304xx-vdpau-driver           <none>                     <none>                     (no description available)
un  nvidia-legacy-340xx-vdpau-driver           <none>                     <none>                     (no description available)
un  nvidia-vdpau-driver                        <none>                     <none>                     (no description available)
  • NVIDIA container library version from nvidia-container-cli -V
cli-version: 1.11.0
lib-version: 1.11.0
build date: 2022-09-06T09:21+00:00
build revision: c8f267be0bac1c654d59ad4ea5df907141149977
build compiler: x86_64-linux-gnu-gcc-6 6.3.0 20170516
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -v /var/run:/var/run -it debian bash

Just a question: Does this work when you don't include the -v /var/run:/var/run flag?

Just a question: Does this work when you don't include the -v /var/run:/var/run flag?

there is no problem without -v /var/run:/var/run flag @elezar

https://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon
There are ways to open this:

  1. Persistent Mode (Legacy)
  2. Persistent daemon
    There is no problem with using the first old method to open.

I was able to reproduce this locally (by including the -v /var/run:/var/run flag when starting a docker container.

I have updated the "mount" code in libnvidia-container to check whether a file exists before creating it in
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/187.

Are you in a position to test this patch?

Hi we have made a version of the NVIDIA Container Toolkit (v1.12.0-rc.2) including this fix available in our experimental repositories. If possible please install this version and confirm that it addresses your issue.

We have released v1.12.0 of the NVIDIA Container Toolkit including a fix for this. Please use this release and reopen this issue if you're still seeing issues.