Permissions of nvidia-container-runtime with podman not working
Ru13en opened this issue · 5 comments
1. Issue or feature description
For each system boot/reboot rootless podman does not work with the nvidia plugin.
I must run nvidia-smi, otherwise i get the error:
Error: error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1): OCI runtime error
After that i also need to run the NVIDIA Device Node Verification script to proper startup the /dev/nvidia-uvm for CUDA applications as described in this post:
tensorflow/tensorflow#32623 (comment)
2. Steps to reproduce the issue
Install CentOS 8 with selinux enabled + Nvidia linux drivers.
Install podman and nvidia-container-runtime
Configure /etc/nvidia-container-runtime/config.toml (as attachment)
Reboot the HW
Run the command (it will fail if you dont use nvidia-smi & nvidia-device-node-verification, after each reboot):
podman run --privileged -it nvidia/cuda:11.3.1-base-centos8 nvidia-smi
podman run --privileged -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
Run the commands (it will work):
nvidia-smi
podman run --privileged -it nvidia/cuda:11.3.1-base-centos8 nvidia-smi
sh nvidia-device-node-verification.sh #(from https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-verifications)
podman run --privileged -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
3. Information to attach (optional if deemed irrelevant)
NAME="CentOS Linux"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
CentOS Linux release 8.3.2011
CentOS Linux release 8.3.2011
getenforce:
Enforcing
podman info:
arch: amd64
buildahVersion: 1.20.1
cgroupManager: cgroupfs
cgroupVersion: v1
conmon:
package: conmon-2.0.27-1.el8.1.5.x86_64
path: /usr/bin/conmon
version: 'conmon version 2.0.27, commit: '
cpus: 80
distribution:
distribution: '"centos"'
version: "8"
eventLogger: journald
hostname: turing
idMappings:
gidmap:
- container_id: 0
host_id: 2002
size: 1
- container_id: 1
host_id: 100000
size: 65536
uidmap:
- container_id: 0
host_id: 2002
size: 1
- container_id: 1
host_id: 100000
size: 65536
kernel: 4.18.0-240.22.1.el8_3.x86_64
linkmode: dynamic
memFree: 781801324544
memTotal: 809933586432
ociRuntime:
name: crun
package: crun-0.19.1-2.el8.3.1.x86_64
path: /usr/bin/crun
version: |-
crun version 0.19.1
commit: 1535fedf0b83fb898d449f9680000f729ba719f5
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +YAJL
os: linux
remoteSocket:
path: /run/user/2002/podman/podman.sock
security:
apparmorEnabled: false
capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
rootless: true
seccompEnabled: true
selinuxEnabled: true
slirp4netns:
executable: /usr/bin/slirp4netns
package: slirp4netns-1.1.8-4.el8.7.6.x86_64
version: |-
slirp4netns version 1.1.8
commit: d361001f495417b880f20329121e3aa431a8f90f
libslirp: 4.3.1
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.4.3
swapFree: 42949668864
swapTotal: 42949668864
uptime: 29h 16m 48.14s (Approximately 1.21 days)
registries:
search:
- docker.io
- quay.io
store:
configFile: /home/user/.config/containers/storage.conf
containerStore:
number: 29
paused: 0
running: 0
stopped: 29
graphDriverName: overlay
graphOptions:
overlay.mount_program:
Executable: /usr/bin/fuse-overlayfs
Package: fuse-overlayfs-1.5.0-1.el8.5.3.x86_64
Version: |-
fusermount3 version: 3.2.1
fuse-overlayfs: version 1.5
FUSE library version 3.2.1
using FUSE kernel interface version 7.26
graphRoot: /home/user/.local/share/containers/storage
graphStatus:
Backing Filesystem: xfs
Native Overlay Diff: "false"
Supports d_type: "true"
Using metacopy: "false"
imageStore:
number: 28
runRoot: /run/user/2002/containers
volumePath: /home/user/.local/share/containers/storage/volumes
version:
APIVersion: 3.1.2
Built: 1619185402
BuiltTime: Fri Apr 23 14:43:22 2021
GitCommit: ""
GoVersion: go1.14.12
OsArch: linux/amd64
Version: 3.1.2
nvidia-smi | grep Version
NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3
cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false
[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = true
#user = "root:video"
ldconfig = "@/sbin/ldconfig"
Thanks for creating the new issue @Ru13en
Here I would assume that kernel modules cannot be loaded by the NVIDIA container runtime hook. This also prevents the device nodes from being created. nvidia-smi
ends up loading the Kernel modules and creating the device nodes, but does seem to skip the creation of nvidia-uvm
and nvidia-uvm-tools
-- which is handled by the "Device Node verification" script that you mentioned.
Is it possible to run the script on startup of the system?
@elezar Yes, i fixed it creating a script that runs both commands at startup. However, is not a user friendly approach...
I don't know whether there is a way around this for rootless podman (I would have to check), but I would expect this to work in the rootful case since the NVIDIA container toolkit DOES load the kernel modules and create the devices nodes on the host as part of creating the container. Could you uncomment the debug
option in the toolkit config (#debug = "/var/log/nvidia-container-toolkit.log"
) and attach the contents of the file when launching a rootful container that fails?
Testing with:
podman run --privileged -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
@elezar for some reason now i cannot replicate the issue for rootful runs, but the behavior continues on rootless (maybe with some updates it was fixed, since i made the previous post in May).
For rootless, unless the root user starts a container it will trigger:
Error: error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1): OCI runtime error
If i run the command with sudo and after without it, it runs normally (the NVIDIA container toolkit is loading the kernel modules and devices nodes)
Please see the updated instructions for running the NVIDIA Container Runtime with Podman.
If you're still having problems, please open a new issue against https://github.com/NVIDIA/nvidia-container-toolkit.