ublue-os/hwe

GPUs in containers appears to have broken

Closed this issue · 1 comments

The instructions to test GPUs in containers don't appear to be working. I run:

podman run \
    --user 1000:1000 \
    --security-opt=no-new-privileges \
    --cap-drop=ALL \
    --security-opt label=type:nvidia_container_t  \
    docker.io/nvidia/samples:vectoradd-cuda11.2.1

and get the following output:

Trying to pull docker.io/nvidia/samples:vectoradd-cuda11.2.1...
Getting image source signatures
Copying blob fe72fda9c19e done   |
Copying blob b3afe92c540b done   |
Copying blob ddb025f124b9 done   |
Copying blob b25f8d7adb24 done   |
Copying blob d519e2592276 done   |
Copying blob d22d2dfcfa9c done   |
Copying blob c88b7b7dd6ba done   |
Copying blob f2c9b54e36bc done   |
Copying blob 50333516d41c done   |
Copying config 02c32dc6d0 done   |
Writing manifest to image destination
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

nvidia-smi output:

Sun Oct 22 19:22:29 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0  On |                  Off |
|  0%   34C    P8              11W / 450W |    465MiB / 24564MiB |     16%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2295      G   /usr/bin/gnome-shell                        154MiB |
|    0   N/A  N/A      3495      G   /usr/lib64/firefox/firefox                  249MiB |
|    0   N/A  N/A      4459      G   /usr/bin/alacritty                           28MiB |
+---------------------------------------------------------------------------------------+

just --unstable nvidia-test-cuda:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

sudo nvidia-container-cli -k -d /dev/tty info:

-- WARNING, the following logs are for debugging purposes only --

I1023 02:24:09.451383 5247 nvc.c:376] initializing library context (version=1.14.3, build=1eb5a30a6ad0415550a9df632ac8832bf7e2bbba)
I1023 02:24:09.451465 5247 nvc.c:350] using root /
I1023 02:24:09.451470 5247 nvc.c:351] using ldcache /etc/ld.so.cache
I1023 02:24:09.451475 5247 nvc.c:352] using unprivileged user 65534:65534
I1023 02:24:09.451504 5247 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1023 02:24:09.451575 5247 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I1023 02:24:09.464419 5248 nvc.c:278] loading kernel module nvidia
I1023 02:24:09.464518 5248 nvc.c:282] running mknod for /dev/nvidiactl
I1023 02:24:09.464550 5248 nvc.c:286] running mknod for /dev/nvidia0
I1023 02:24:09.464568 5248 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I1023 02:24:09.469693 5248 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I1023 02:24:09.469798 5248 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I1023 02:24:09.471388 5248 nvc.c:296] loading kernel module nvidia_uvm
I1023 02:24:09.471433 5248 nvc.c:300] running mknod for /dev/nvidia-uvm
I1023 02:24:09.471527 5248 nvc.c:305] loading kernel module nvidia_modeset
I1023 02:24:09.471559 5248 nvc.c:309] running mknod for /dev/nvidia-modeset
I1023 02:24:09.471908 5249 rpc.c:71] starting driver rpc service
I1023 02:24:09.477201 5250 rpc.c:71] starting nvcgo rpc service
I1023 02:24:09.487458 5247 nvc_info.c:798] requesting driver information with ''
I1023 02:24:09.488257 5247 nvc_info.c:176] selecting /usr/lib64/libnvoptix.so.535.113.01
I1023 02:24:09.488289 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-tls.so.535.113.01
I1023 02:24:09.488310 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-rtcore.so.535.113.01
I1023 02:24:09.488346 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.535.113.01
I1023 02:24:09.488367 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-pkcs11-openssl3.so.535.113.01
I1023 02:24:09.488687 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-opticalflow.so.535.113.01
I1023 02:24:09.488742 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-opencl.so.535.113.01
I1023 02:24:09.488782 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-nvvm.so.535.113.01
I1023 02:24:09.488848 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-ngx.so.535.113.01
I1023 02:24:09.488880 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-ml.so.535.113.01
I1023 02:24:09.488923 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-glvkspirv.so.535.113.01
I1023 02:24:09.488944 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-glsi.so.535.113.01
I1023 02:24:09.488964 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-glcore.so.535.113.01
I1023 02:24:09.489006 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-fbc.so.535.113.01
I1023 02:24:09.489051 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-encode.so.535.113.01
I1023 02:24:09.489087 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-eglcore.so.535.113.01
I1023 02:24:09.489125 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-cfg.so.535.113.01
I1023 02:24:09.489158 5247 nvc_info.c:176] selecting /usr/lib64/libnvidia-allocator.so.535.113.01
I1023 02:24:09.489179 5247 nvc_info.c:176] selecting /usr/lib64/libnvcuvid.so.535.113.01
I1023 02:24:09.489312 5247 nvc_info.c:176] selecting /usr/lib64/libcudadebugger.so.535.113.01
I1023 02:24:09.489333 5247 nvc_info.c:176] selecting /usr/lib64/libcuda.so.535.113.01
I1023 02:24:09.489420 5247 nvc_info.c:176] selecting /usr/lib64/libGLX_nvidia.so.535.113.01
I1023 02:24:09.489453 5247 nvc_info.c:176] selecting /usr/lib64/libGLESv2_nvidia.so.535.113.01
I1023 02:24:09.489484 5247 nvc_info.c:176] selecting /usr/lib64/libGLESv1_CM_nvidia.so.535.113.01
I1023 02:24:09.489507 5247 nvc_info.c:176] selecting /usr/lib64/libEGL_nvidia.so.535.113.01
I1023 02:24:09.490482 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-tls.so.535.113.01
I1023 02:24:09.491316 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-ptxjitcompiler.so.535.113.01
I1023 02:24:09.491584 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-opticalflow.so.535.113.01
I1023 02:24:09.492166 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-opencl.so.535.113.01
I1023 02:24:09.492770 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-nvvm.so.535.113.01
I1023 02:24:09.493375 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-ml.so.535.113.01
I1023 02:24:09.493985 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-glvkspirv.so.535.113.01
I1023 02:24:09.494307 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-glsi.so.535.113.01
I1023 02:24:09.494887 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-glcore.so.535.113.01
I1023 02:24:09.495182 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-fbc.so.535.113.01
I1023 02:24:09.495484 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-encode.so.535.113.01
I1023 02:24:09.496078 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-eglcore.so.535.113.01
I1023 02:24:09.496388 5247 nvc_info.c:176] selecting /usr/lib/libnvidia-allocator.so.535.113.01
I1023 02:24:09.496975 5247 nvc_info.c:176] selecting /usr/lib/libnvcuvid.so.535.113.01
I1023 02:24:09.497526 5247 nvc_info.c:176] selecting /usr/lib/libcuda.so.535.113.01
I1023 02:24:09.498115 5247 nvc_info.c:176] selecting /usr/lib/libGLX_nvidia.so.535.113.01
I1023 02:24:09.498685 5247 nvc_info.c:176] selecting /usr/lib/libGLESv2_nvidia.so.535.113.01
I1023 02:24:09.499039 5247 nvc_info.c:176] selecting /usr/lib/libGLESv1_CM_nvidia.so.535.113.01
I1023 02:24:09.499378 5247 nvc_info.c:176] selecting /usr/lib/libEGL_nvidia.so.535.113.01
W1023 02:24:09.499395 5247 nvc_info.c:402] missing library libnvidia-nscq.so
W1023 02:24:09.499401 5247 nvc_info.c:402] missing library libnvidia-gpucomp.so
W1023 02:24:09.499406 5247 nvc_info.c:402] missing library libnvidia-fatbinaryloader.so
W1023 02:24:09.499412 5247 nvc_info.c:402] missing library libnvidia-compiler.so
W1023 02:24:09.499417 5247 nvc_info.c:402] missing library libnvidia-pkcs11.so
W1023 02:24:09.499423 5247 nvc_info.c:402] missing library libvdpau_nvidia.so
W1023 02:24:09.499428 5247 nvc_info.c:402] missing library libnvidia-ifr.so
W1023 02:24:09.499434 5247 nvc_info.c:402] missing library libnvidia-cbl.so
W1023 02:24:09.499439 5247 nvc_info.c:406] missing compat32 library libnvidia-cfg.so
W1023 02:24:09.499446 5247 nvc_info.c:406] missing compat32 library libnvidia-nscq.so
W1023 02:24:09.499451 5247 nvc_info.c:406] missing compat32 library libcudadebugger.so
W1023 02:24:09.499456 5247 nvc_info.c:406] missing compat32 library libnvidia-gpucomp.so
W1023 02:24:09.499461 5247 nvc_info.c:406] missing compat32 library libnvidia-fatbinaryloader.so
W1023 02:24:09.499466 5247 nvc_info.c:406] missing compat32 library libnvidia-compiler.so
W1023 02:24:09.499471 5247 nvc_info.c:406] missing compat32 library libnvidia-pkcs11.so
W1023 02:24:09.499475 5247 nvc_info.c:406] missing compat32 library libnvidia-pkcs11-openssl3.so
W1023 02:24:09.499479 5247 nvc_info.c:406] missing compat32 library libnvidia-ngx.so
W1023 02:24:09.499484 5247 nvc_info.c:406] missing compat32 library libvdpau_nvidia.so
W1023 02:24:09.499490 5247 nvc_info.c:406] missing compat32 library libnvidia-ifr.so
W1023 02:24:09.499495 5247 nvc_info.c:406] missing compat32 library libnvidia-rtcore.so
W1023 02:24:09.499501 5247 nvc_info.c:406] missing compat32 library libnvoptix.so
W1023 02:24:09.499506 5247 nvc_info.c:406] missing compat32 library libnvidia-cbl.so
I1023 02:24:09.499991 5247 nvc_info.c:302] selecting /usr/bin/nvidia-smi
I1023 02:24:09.500014 5247 nvc_info.c:302] selecting /usr/bin/nvidia-debugdump
I1023 02:24:09.500036 5247 nvc_info.c:302] selecting /usr/bin/nvidia-persistenced
I1023 02:24:09.500076 5247 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-control
I1023 02:24:09.500099 5247 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-server
W1023 02:24:09.500203 5247 nvc_info.c:428] missing binary nv-fabricmanager
I1023 02:24:09.500259 5247 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/535.113.01/gsp_ga10x.bin
I1023 02:24:09.500267 5247 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/535.113.01/gsp_tu10x.bin
I1023 02:24:09.500298 5247 nvc_info.c:561] listing device /dev/nvidiactl
I1023 02:24:09.500303 5247 nvc_info.c:561] listing device /dev/nvidia-uvm
I1023 02:24:09.500309 5247 nvc_info.c:561] listing device /dev/nvidia-uvm-tools
I1023 02:24:09.500315 5247 nvc_info.c:561] listing device /dev/nvidia-modeset
W1023 02:24:09.500393 5247 nvc_info.c:352] missing ipc path /var/run/nvidia-persistenced/socket
W1023 02:24:09.500414 5247 nvc_info.c:352] missing ipc path /var/run/nvidia-fabricmanager/socket
W1023 02:24:09.500453 5247 nvc_info.c:352] missing ipc path /tmp/nvidia-mps
I1023 02:24:09.500457 5247 nvc_info.c:854] requesting device information with ''
I1023 02:24:09.506035 5247 nvc_info.c:745] listing device /dev/nvidia0 (GPU-cf22c558-a271-8c6a-eed9-067e58966643 at 00000000:01:00.0)
NVRM version:   535.113.01
CUDA version:   12.2

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce RTX 4090
Brand:          GeForce
GPU UUID:       GPU-cf22c558-a271-8c6a-eed9-067e58966643
Bus Location:   00000000:01:00.0
Architecture:   8.9
I1023 02:24:09.506058 5247 nvc.c:434] shutting down library context
I1023 02:24:09.506082 5250 rpc.c:95] terminating nvcgo rpc service
I1023 02:24:09.506457 5247 rpc.c:135] nvcgo rpc service terminated successfully
I1023 02:24:09.507428 5249 rpc.c:95] terminating driver rpc service
I1023 02:24:09.507548 5247 rpc.c:135] driver rpc service terminated successfully

nvidia-container-cli -V:

cli-version: 1.14.3
lib-version: 1.14.3
build date: 2023-10-19T11:32+0000
build revision: 1eb5a30a6ad0415550a9df632ac8832bf7e2bbba
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

This was a documentation issue; the sample podman command was missing “ --device=nvidia.com/gpu=all”