NVIDIA/nvidia-container-runtime

Permission denied with sudo command

Choiuijin1125 opened this issue · 5 comments

1. Issue or feature description

When I run commands(nvidia-container-cli info, list) with sudo command I got below error.
so I can't use nvidia-docker like docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

2. Steps to reproduce the issue

not working

sudo nvidia-container-cli list

nvidia-container-cli: initialization error: open failed: /proc/sys/kernel/overflowuid: permission denied

wokring

nvidia-container-cli list

/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/dev/nvidia1
/dev/nvidia2
/dev/nvidia3
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.455.23.05
/usr/lib/x86_64-linux-gnu/libcuda.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.455.23.05
/usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvoptix.so.455.23.05
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.455.23.05
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.455.23.05
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.455.23.05
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.455.23.05
/usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.455.23.05

Could you give more information on your setup including distribution and NVIDIA Container Toolkit package versions?

This sounds like an issue with user-namespaces independent of the nvidia-container-runtime. What OS are you running on and what peculiarities might you have configured on your system beyond a stock distribution?

this is server information

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --

I0323 07:31:52.052389 30761 nvc.c:376] initializing library context (version=1.9.0, build=5e135c17d6dbae861ec343e9a8d3a0d2af758a4f)
I0323 07:31:52.052632 30761 nvc.c:350] using root /
I0323 07:31:52.052649 30761 nvc.c:351] using ldcache /etc/ld.so.cache
I0323 07:31:52.052658 30761 nvc.c:352] using unprivileged user 1000:1002
I0323 07:31:52.052714 30761 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0323 07:31:52.052897 30761 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0323 07:31:52.059423 30762 nvc.c:273] failed to set inheritable capabilities
W0323 07:31:52.059488 30762 nvc.c:274] skipping kernel modules load due to failure
I0323 07:31:52.060095 30763 rpc.c:71] starting driver rpc service
I0323 07:31:52.065595 30764 rpc.c:71] starting nvcgo rpc service
I0323 07:31:52.070177 30761 nvc_info.c:765] requesting driver information with ''
I0323 07:31:52.072263 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.455.23.05
I0323 07:31:52.072526 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.455.23.05
I0323 07:31:52.072683 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.455.23.05
I0323 07:31:52.073086 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.455.23.05
I0323 07:31:52.073480 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.455.23.05
I0323 07:31:52.073675 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.455.23.05
I0323 07:31:52.073789 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.455.23.05
I0323 07:31:52.073913 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.455.23.05
I0323 07:31:52.073973 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.455.23.05
I0323 07:31:52.074100 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.455.23.05
I0323 07:31:52.074318 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.455.23.05
I0323 07:31:52.074403 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.455.23.05
I0323 07:31:52.074492 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.455.23.05
I0323 07:31:52.074630 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.455.23.05
I0323 07:31:52.075154 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.455.23.05
I0323 07:31:52.075542 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.455.23.05
I0323 07:31:52.075770 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.455.23.05
I0323 07:31:52.075858 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.455.23.05
I0323 07:31:52.076018 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.455.23.05
I0323 07:31:52.076225 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.455.23.05
I0323 07:31:52.076309 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.455.23.05
I0323 07:31:52.076614 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.455.23.05
I0323 07:31:52.076949 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.455.23.05
I0323 07:31:52.077066 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.455.23.05
I0323 07:31:52.077133 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.455.23.05
I0323 07:31:52.077370 30761 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.455.23.05
W0323 07:31:52.077418 30761 nvc_info.c:398] missing library libnvidia-nscq.so
W0323 07:31:52.077427 30761 nvc_info.c:398] missing library libnvidia-fatbinaryloader.so
W0323 07:31:52.077439 30761 nvc_info.c:398] missing library libnvidia-pkcs11.so
W0323 07:31:52.077444 30761 nvc_info.c:402] missing compat32 library libnvidia-ml.so
W0323 07:31:52.077456 30761 nvc_info.c:402] missing compat32 library libnvidia-cfg.so
W0323 07:31:52.077468 30761 nvc_info.c:402] missing compat32 library libnvidia-nscq.so
W0323 07:31:52.077477 30761 nvc_info.c:402] missing compat32 library libcuda.so
W0323 07:31:52.077488 30761 nvc_info.c:402] missing compat32 library libnvidia-opencl.so
W0323 07:31:52.077499 30761 nvc_info.c:402] missing compat32 library libnvidia-ptxjitcompiler.so
W0323 07:31:52.077510 30761 nvc_info.c:402] missing compat32 library libnvidia-fatbinaryloader.so
W0323 07:31:52.077524 30761 nvc_info.c:402] missing compat32 library libnvidia-allocator.so
W0323 07:31:52.077532 30761 nvc_info.c:402] missing compat32 library libnvidia-compiler.so
W0323 07:31:52.077545 30761 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so
W0323 07:31:52.077554 30761 nvc_info.c:402] missing compat32 library libnvidia-ngx.so
W0323 07:31:52.077563 30761 nvc_info.c:402] missing compat32 library libvdpau_nvidia.so
W0323 07:31:52.077572 30761 nvc_info.c:402] missing compat32 library libnvidia-encode.so
W0323 07:31:52.077578 30761 nvc_info.c:402] missing compat32 library libnvidia-opticalflow.so
W0323 07:31:52.077587 30761 nvc_info.c:402] missing compat32 library libnvcuvid.so
W0323 07:31:52.077594 30761 nvc_info.c:402] missing compat32 library libnvidia-eglcore.so
W0323 07:31:52.077602 30761 nvc_info.c:402] missing compat32 library libnvidia-glcore.so
W0323 07:31:52.077609 30761 nvc_info.c:402] missing compat32 library libnvidia-tls.so
W0323 07:31:52.077617 30761 nvc_info.c:402] missing compat32 library libnvidia-glsi.so
W0323 07:31:52.077627 30761 nvc_info.c:402] missing compat32 library libnvidia-fbc.so
W0323 07:31:52.077639 30761 nvc_info.c:402] missing compat32 library libnvidia-ifr.so
W0323 07:31:52.077648 30761 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so
W0323 07:31:52.077656 30761 nvc_info.c:402] missing compat32 library libnvoptix.so
W0323 07:31:52.077668 30761 nvc_info.c:402] missing compat32 library libGLX_nvidia.so
W0323 07:31:52.077678 30761 nvc_info.c:402] missing compat32 library libEGL_nvidia.so
W0323 07:31:52.077687 30761 nvc_info.c:402] missing compat32 library libGLESv2_nvidia.so
W0323 07:31:52.077698 30761 nvc_info.c:402] missing compat32 library libGLESv1_CM_nvidia.so
W0323 07:31:52.077704 30761 nvc_info.c:402] missing compat32 library libnvidia-glvkspirv.so
W0323 07:31:52.077715 30761 nvc_info.c:402] missing compat32 library libnvidia-cbl.so
I0323 07:31:52.078680 30761 nvc_info.c:298] selecting /usr/bin/nvidia-smi
I0323 07:31:52.078726 30761 nvc_info.c:298] selecting /usr/bin/nvidia-debugdump
I0323 07:31:52.078758 30761 nvc_info.c:298] selecting /usr/bin/nvidia-persistenced
I0323 07:31:52.078867 30761 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-control
I0323 07:31:52.078902 30761 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-server
W0323 07:31:52.078967 30761 nvc_info.c:424] missing binary nv-fabricmanager
W0323 07:31:52.079162 30761 nvc_info.c:348] missing firmware path /lib/firmware/nvidia/455.23.05/gsp.bin
I0323 07:31:52.079194 30761 nvc_info.c:528] listing device /dev/nvidiactl
I0323 07:31:52.079201 30761 nvc_info.c:528] listing device /dev/nvidia-uvm
I0323 07:31:52.079212 30761 nvc_info.c:528] listing device /dev/nvidia-uvm-tools
I0323 07:31:52.079222 30761 nvc_info.c:528] listing device /dev/nvidia-modeset
W0323 07:31:52.079260 30761 nvc_info.c:348] missing ipc path /var/run/nvidia-persistenced/socket
W0323 07:31:52.079288 30761 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket
W0323 07:31:52.079308 30761 nvc_info.c:348] missing ipc path /tmp/nvidia-mps
I0323 07:31:52.079315 30761 nvc_info.c:821] requesting device information with ''
I0323 07:31:52.089701 30761 nvc_info.c:712] listing device /dev/nvidia0 (GPU-f26f1091-107b-7f7e-ccc2-c3a6c7da082c at 00000000:00:06.0)
I0323 07:31:52.105924 30761 nvc_info.c:712] listing device /dev/nvidia1 (GPU-73bd9d4d-1037-f06a-c1d5-00e580457554 at 00000000:00:07.0)
I0323 07:31:52.117356 30761 nvc_info.c:712] listing device /dev/nvidia2 (GPU-12127740-3804-2777-66a2-d51623c3d17b at 00000000:00:08.0)
I0323 07:31:52.129291 30761 nvc_info.c:712] listing device /dev/nvidia3 (GPU-d2cc3c0c-c504-a135-ec1a-e712af5b8380 at 00000000:00:09.0)
NVRM version:   455.23.05
CUDA version:   11.1

Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Tesla
GPU UUID:       GPU-f26f1091-107b-7f7e-ccc2-c3a6c7da082c
Bus Location:   00000000:00:06.0
Architecture:   7.5

Device Index:   1
Device Minor:   1
Model:          Tesla T4
Brand:          Tesla
GPU UUID:       GPU-73bd9d4d-1037-f06a-c1d5-00e580457554
Bus Location:   00000000:00:07.0
Architecture:   7.5

Device Index:   2
Device Minor:   2
Model:          Tesla T4
Brand:          Tesla
GPU UUID:       GPU-12127740-3804-2777-66a2-d51623c3d17b
Bus Location:   00000000:00:08.0
Architecture:   7.5

Device Index:   3
Device Minor:   3
Model:          Tesla T4
Brand:          Tesla
GPU UUID:       GPU-d2cc3c0c-c504-a135-ec1a-e712af5b8380
Bus Location:   00000000:00:09.0
Architecture:   7.5
I0323 07:31:52.129507 30761 nvc.c:430] shutting down library context
I0323 07:31:52.129662 30764 rpc.c:95] terminating nvcgo rpc service
I0323 07:31:52.130498 30761 rpc.c:135] nvcgo rpc service terminated successfully
I0323 07:31:52.134364 30763 rpc.c:95] terminating driver rpc service
I0323 07:31:52.134608 30761 rpc.c:135] driver rpc service terminated successfully
  • Kernel version from uname -a
Linux gpu-1 4.15.0-159-generic #167-Ubuntu SMP Tue Sep 21 08:55:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

  • Docker version from docker version
Client: Docker Engine - Community
 Version:           20.10.12
 API version:       1.41
 Go version:        go1.16.12
 Git commit:        e91ed57
 Built:             Mon Dec 13 11:45:27 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.12
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.12
  Git commit:       459d0df
  Built:            Mon Dec 13 11:43:36 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.12
  GitCommit:        7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
un  libgldispatch0-nvidia                                 <none>                          <none>                          (no description available)
rc  libnvidia-compute-450:amd64                           450.142.00-0ubuntu1             amd64                           NVIDIA libcompute package
ii  libnvidia-container-tools                             1.9.0-1                         amd64                           NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                            1.9.0-1                         amd64                           NVIDIA container runtime library
un  libnvidia-ml1                                         <none>                          <none>                          (no description available)
un  nvidia-304                                            <none>                          <none>                          (no description available)
un  nvidia-340                                            <none>                          <none>                          (no description available)
un  nvidia-384                                            <none>                          <none>                          (no description available)
un  nvidia-common                                         <none>                          <none>                          (no description available)
un  nvidia-container-runtime                              <none>                          <none>                          (no description available)
un  nvidia-container-runtime-hook                         <none>                          <none>                          (no description available)
ii  nvidia-container-toolkit                              1.9.0-1                         amd64                           NVIDIA container runtime hook
un  nvidia-docker                                         <none>                          <none>                          (no description available)
ii  nvidia-docker2                                        2.10.0-1                        all                             nvidia-docker CLI wrapper
un  nvidia-opencl-icd                                     <none>                          <none>                          (no description available)
un  nvidia-prime                                          <none>                          <none>                          (no description available)
  • NVIDIA container library version from nvidia-container-cli -V
cli-version: 1.9.0
lib-version: 1.9.0
build date: 2022-03-18T13:46+00:00
build revision: 5e135c17d6dbae861ec343e9a8d3a0d2af758a4f
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

@klueska I also suspect user-namespaces issues, I'm using ubuntu 18. 04 and only changed docker group permission like below

sudo usermod -aG docker $USER

One thing that I'm suspect is that I didn't reboot server after install nvidia-docker2
I'll try to reboot and test sometime soon

elezar commented

Please see NVIDIA/nvidia-container-toolkit#102 which presents similar behaviour and a possible resolution.