/run/nvidia-persistenced/socket: no such device or address
matyro opened this issue · 19 comments
Hi,
we are currently setting up a new cluster deployment environment with slurm, pyxis and enroot.
Our machines have DGX OS installed.
Container images like centos
srun --container-image=centos grep PRETTY /etc/os-release
finish without a problem. GPU Based images like
srun --container-image=nvcr.io/nvidia/tensorflow:22.08-tf2-py3 /bin/bash
experiencing a problem during startup
nvidia-container-cli: mount error: file creation failed: /raid/enroot-data/user-9011/pyxis_59.0/run/nvidia-persistenced/socket: no such device or address
[ERROR] /raid/enroot//hooks.d/98-nvidia.sh exited with return code 1
I think it is some misconfiguration, but at the moment I am not able to spot it.
Hi @matyro. Could you file an issue at https://github.com/NVIDIA/enroot instead as they may have a better idea as to what is happening here. If there is an issue with the NVIDIA Container CLI then please update this issue with the relevant context.
It seems unsure where the exact origin of the problem is.
At the moment we are using a DGX A100 node with no major changes to DGX OS for testing.
Thanks @matyro. As a matter of interest, which version of libnvidia-container-tools
is being used?
Hi, not the newest release:
nvidia-container-cli -V
cli-version: 1.7.0
lib-version: 1.7.0
build date: 2021-11-30T19:53+00:00
build revision: f37bb387ad05f6e501069d99e4135a97289faf1f
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
Updating to the newest system available did not help:
enroot start -e NVIDIA_VISIBLE_DEVICES=all --root --rw cuda_root
nvidia-container-cli: container error: stat failed: /raid/enroot-data/cuda_root/proc/12474: no such file or directory
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
root@ml2ran03:~# nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
Enroot base is working, if GPUs are deactivated:
enroot start -e NVIDIA_VISIBLE_DEVICES=void --root --rw cuda_root
So it must be somewhere in the interconnection between booth
@matyro this doesn't show the same error creating /run
though, or are the error messages in fact identical?
You are right, it seems proc was not mounted correctly, its empty
root@ml2ran03:/# ll /
total 84
drwxrwxr-x 17 root root 4096 Oct 4 19:29 ./
drwxrwxr-x 17 root root 4096 Oct 4 19:29 ../
-rw-r--r-- 1 root root 0 Oct 6 13:01 .lock
-rw-r--r-- 1 root root 16047 Aug 8 21:08 NGC-DL-CONTAINER-LICENSE
lrwxrwxrwx 1 root root 7 Aug 1 13:22 bin -> usr/bin/
drwxr-xr-x 2 root root 4096 Apr 15 2020 boot/
drwxr-xr-x 2 root root 4096 Oct 5 10:17 dev/
drwxrwxr-x 36 root root 4096 Oct 4 18:57 etc/
drwxr-xr-x 2 root root 4096 Apr 15 2020 home/
lrwxrwxrwx 1 root root 7 Aug 1 13:22 lib -> usr/lib/
lrwxrwxrwx 1 root root 9 Aug 1 13:22 lib32 -> usr/lib32/
lrwxrwxrwx 1 root root 9 Aug 1 13:22 lib64 -> usr/lib64/
lrwxrwxrwx 1 root root 10 Aug 1 13:22 libx32 -> usr/libx32/
drwxr-xr-x 2 root root 4096 Aug 1 13:22 media/
drwxr-xr-x 2 root root 4096 Aug 1 13:22 mnt/
drwxr-xr-x 2 root root 4096 Aug 1 13:22 opt/
drwxr-xr-x 2 root root 4096 Apr 15 2020 proc/
drwx------ 2 root root 4096 Oct 5 17:46 root/
drwxr-xr-x 5 root root 4096 Aug 1 13:25 run/
lrwxrwxrwx 1 root root 8 Aug 1 13:22 sbin -> usr/sbin/
drwxr-xr-x 2 root root 4096 Aug 1 13:22 srv/
drwxr-xr-x 3 root root 4096 Oct 5 10:17 sys/
drwxrwxrwt 2 root root 4096 Aug 1 13:25 tmp/
drwxr-xr-x 13 root root 4096 Aug 1 13:22 usr/
drwxr-xr-x 11 root root 4096 Aug 1 13:25 var/
The error depends on the config and/or version installed.
With default config files, and up-to-date version /proc seems to be the first error.
Hi @matyro with regards to the error related to /run/nvidia-persistenced/socket
after seeing similar behaviour in NVIDIA/nvidia-docker#1690 I was able to reproduce this locally (by including the -v /var/run:/var/run
flag when starting a docker container.
I have updated the "mount" code in libnvidia-container
to check whether a file exists before creating it in
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/187.
Are you in a position to test this patch?
We are currently having a hardware problem on one of our nodes and are in contact with enterprise support. When the hardware exchange is done next week we will have a spare node available and should be able to run some tests.
The machine is back, what is the simplest way to test it now?
I already cloned the repository and switched the branch to your MR.
Running make -f mk/docker.mk ubunut18.04-amd64
from the libnvidia-container
repo root should generate debian packages in ./dist/ubuntu18.04/amd64
(replace this with your distro of choice).
You can then install the deb
files directly by: sudo dpkg -i dist/ubuntu18.04/amd64/libnvidia-container1_1.12.0~rc.1-1_amd64.deb dist/ubunut18.04/amd64/libnvidia-container-tools_1.12.0~rc.2-1_amd64.deb
This will then be available system-wide. You can confirm the commit used to build the nvidia-container-cli
by running:
nvidia-container-cli --version
A (forced) reinstall of the libnvidia-container1
and libnvidia-container-tools
packages from our public repos should restore the state.
Hi,
I deployed the fixed version today on our test machine and enroot is working fine.
GPUs are limited to the selected device and the container starts without an error message.
root@ml2ran01:~# NVIDIA_VISIBLE_DEVICES=1 enroot start --root --rw cuda10.2-U18.04 nvidia-smi
Thu Oct 27 08:55:45 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off | 0 |
| N/A 28C P0 50W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@ml2ran01:~#
Only SLURM is still a bit problematic, but this should have nothing to do with libnvidia-container:
root@gwkilab:~# srun --gres=gpu:1 --container-image=nvcr.io/nvidia/cuda:10.2-devel-ubuntu18.04 nvidia-smi
pyxis: importing docker image: nvcr.io/nvidia/cuda:10.2-devel-ubuntu18.04
pyxis: imported docker image: nvcr.io/nvidia/cuda:10.2-devel-ubuntu18.04
Thu Oct 27 08:49:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:07:00.0 Off | 0 |
| N/A 28C P0 52W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:0F:00.0 Off | 0 |
| N/A 28C P0 51W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:47:00.0 Off | 0 |
| N/A 28C P0 52W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:4E:00.0 Off | 0 |
| N/A 29C P0 54W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:87:00.0 Off | 0 |
| N/A 34C P0 57W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:90:00.0 Off | 0 |
| N/A 31C P0 55W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:B7:00.0 Off | 0 |
| N/A 31C P0 51W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off | 0 |
| N/A 32C P0 52W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Thanks for the confirimation @matyro. That's good news,
I'm not sure how srun interacts with the NVIDIA container stack to provide the isolation. Do you have a link to their documentation on this?
https://slurm.schedmd.com/gres.html#GPU_Management
In the slurm Job CUDA_VISIBLE_DEVICES is set correctly. But even directly on the node
CUDA_VISIBLE_DEVICES=0 nvidia-smi
shows all GPUs
Since nvidia-smi
uses the low-level NVIDIA Management Library under the hood, I don't think it checks CUDA_VISIBLE_DEVICES
to filter the list. This seems like an issue with how enroot
is being triggered (through pyxis
) when invoking srun
.
For clarification, the NVIDIA container stack uses the NVIDIA_VISIBLE_DEVICES
environment variable in the container to determine which devices to inject. This is processed specifically by the NVIDIA Container Runtime Hook and is (as far as I am aware) not applicable to enroot
directly.
Everything works now, and I think the problem can be closed from my side.
The only remaining question would be if there is an estimated date when the patched libnvidia-container will be available directly in DGX OS.
Thanks for your help
Dominik
Thanks @matyro. Our current timeline is to release the final version including this fix at the start of December. It should be picked up for distribution through the DGX repos shortly after that.
Note that we will release an rc.2
with these changes between now and then and this will be installable through our public experimental repositories.