NVIDIA/nvidia-container-runtime

Support for RHEL8.4 (ppc64le)

mgiessing opened this issue · 11 comments

Hi, is there an estimated date when support for rhel8.4 will be there?

Thanks!

Hi @mgiessing. Do you mean from a packaging perspective? Have you tried to use the RHEL8.3 (or centos8) packages?

Yes, I mean from a packaging as well as functional perspective. If I try to replace $distribution (which is rhel8.4) with rhel8.3 here:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
sudo yum install -y nvidia-container-runtime-hook

I can install the runtime hook, but encounter this error:

[root@p630-met1 ~]# docker run --rm docker.io/nvidia/cuda-ppc64le:11.3.1-runtime-ubi8 nvidia-smi
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
Error: OCI runtime error: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig terminated with signal 4

Using a RHEL8.3 (bare-metal) distribution works fine.

Here some further information about CUDA, driver & the system:

[root@p630-met1 ~]# uname -a
Linux p630-met1 4.18.0-305.10.2.el8_4.ppc64le #1 SMP Mon Jul 12 04:35:57 EDT 2021 ppc64le ppc64le ppc64le GNU/Linux

[root@p630-met1 ~]# cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.4 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.4"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.4 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8.4:GA"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.4"

[root@p630-met1 ~]# nvidia-smi
Tue Jul 27 12:22:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Tesla V1...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   28C    P0    40W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA Tesla V1...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   32C    P0    40W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA Tesla V1...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   28C    P0    37W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA Tesla V1...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   31C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Thanks!

I've hit the same issue. It's related to the changes in RHEL8.4. The NVIDIA toolkit stack installs correctly, however, we receive a SIGILL when attempting to start a container. I have straces etc. if anyone wants to take a look. I'll try to recompile the toolkit on 8.4 itself in case there's a library or linking issue.

@mgiessing Looks like NVIDIA/libnvidia-container#143 will address this issue

I'v hit the same issue too, but I'm using CentOS8.4 with CUDA 11.4

Looks like they finally merged the fix a few hours ago.....not sure why the delay there, but hopefully it'll be in the upcoming release of libnvidia-container

Hi @dllehr81 and @mgiessing

We have published libnvidia-container 1.5.1~rc.1 with this change to our experimental repositories. Let us know if this addresses the problems that you are seeing. We expect to promote this to stable in the near future.

Thanks Evan! We appreciate it! I built a one-off libnvidia with the proposed solution and didn't have a problem..I'll try your rc.1 and see how it looks!

The full libnvidia-container 1.5.1 release is now out as well.

@mgiessing / @dllehr81 have you been able to test the new releases? We have also added symlinks to centos8 for rhel8.4 so that this can be accessed without manually specifying the distribution as centos8 or rhel8.3.

Please close this issue if the error has been resolved.

Sorry I missed this one. As mentioned by Doug the issue has been resolved with NVIDIA/libnvidia-container#143 and also the symlinks with RHEL8.4 work now. Thanks!