NVIDIA/libnvidia-container

Building libnvidia-container 1.14.5 builds 1.14.4

ich777 opened this issue · 19 comments

A user on the Unraid Forums reported an issue that the the latest driver package is not working properly with Docker after trying it myself I was able to reproduce this error.
I started investigating and building the libraries manually.

When I checkout v1.14.5 I get this output:

HEAD is now at 870d7c5d Merge branch 'cherry-pick-1.14.4' into 'release-1.14'

When I continue the build process these libraries get built:

...
usr/lib/debug/usr/lib/libnvidia-container.so.1
usr/lib/libnvidia-container-go.so.1.14.4
usr/lib/libnvidia-container.so.1
usr/lib/libnvidia-container-go.so.1
usr/lib/pkgconfig/
usr/lib/pkgconfig/libnvidia-container.pc
usr/lib/libnvidia-container-go.so
usr/lib/libnvidia-container.so.1.14.4
...

When building nvidia-container-toolkit it is properly building version 1.14.5

I assume that the driver is not working properly with Docker when using libnvidia-container 1.14.4 and nvidia-container-toolkit 1.14.5.

Did I do something wrong or is this an oversight?

Cheers,
Christoph

We updated the version logic to use git describe --tags to extract the version information. The issue here is that the v1.14.4 and v1.14.5 tags are the same. In order to override the version you can set the LIB_VERSION and LIB_TAG make variables.

This is the logic that we use when building this as part of the NVIDIA Container Toolkit repo here: https://github.com/NVIDIA/nvidia-container-toolkit/blob/0409824106214a55df4c89a41f12c48f492cd51b/scripts/build-all-components.sh#L58-L64

@elezar thanks for the information, will look into that and how I can implement that in my build toolchain.

Anyways, I have to investigate a bit further since even with version 1.14.4 and driver version 550.54.14 I can't utilize my T400 in Docker containers.

@elezar did something else change too since I now get this error when trying to create a container with v1.14.5:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 2: unknown.

EDIT: This is caused because I tried to compile with go 1.22.0 switching back to go 1.20.13 and everything is working, however I'm still not able to use driver version 550.54.14 with Docker.

I'll close this issue since this is resolved.

@ich777 please open an issue against the nvidia-container-toolkit repo with the error messages you're seeing with the 550 driver.

@elezar I already created an issue on the Developer Forums from Nvidia here since I don't think that it's related to libnvidia-container nor nvidia-contaienr-toolkit.
This only happens with driver 550.54.14 and not with 550.40.07 and earlier.

Does nvidia-smi work in the container, or is it applications that are failing?

/cc @klueska

Does nvidia-smi work in the container, or is it applications that are failing?

nvidia-smi is working just fine in the container yes.

There was a new feature included in the 550.54.14 that requires additional support in the NVIDIA Container Toolkit.

We are working to release a version that includes this support but are waiting for some driver components to be published.

For now, could you confirm that adding the --device /dev/nvidia-caps-imex-channels/channel0 device to your container allows it to function.

For now, could you confirm that adding the --device /dev/nvidia-caps-imex-channels/channel0 device to your container allows it to function.

That does not work, I just have these devices:

root@Test:~# ls -la /dev/nvidia*
crw-rw-rw- 1 root root 195, 254 Feb 26 15:07 /dev/nvidia-modeset
crw-rw-rw- 1 root root 240,   0 Feb 26 15:08 /dev/nvidia-uvm
crw-rw-rw- 1 root root 240,   1 Feb 26 15:08 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Feb 26 15:07 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Feb 26 15:07 /dev/nvidiactl

/dev/nvidia-caps:
total 0
drwxr-xr-x  2 root root     80 Feb 26 15:08 ./
drwxr-xr-x 17 root root   3380 Feb 26 15:08 ../
cr--------  1 root root 244, 1 Feb 26 15:08 nvidia-cap1
cr--r--r--  1 root root 244, 2 Feb 26 15:08 nvidia-cap2

This is the output from nvidia-smi from the container (of course without the path that you've mentioned):

root@8a1b7fbf37e8:/# nvidia-smi
Mon Feb 26 15:10:41 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T400                    Off |   00000000:01:00.0 Off |                  N/A |
| 36%   38C    P0             N/A /   31W |       0MiB /   2048MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@8a1b7fbf37e8:/# 

What was your original error? You don't seem to include it in the description (unless I missed it somehow).

What was your original error? You don't seem to include it in the description (unless I missed it somehow).

Some users on the Unraid Forums reported that transcoding with Plex was not working alongside with Jellyfin, so to speak NVENC.

I was able to reproduce this on my test machine, I made a short post on the Nvidia Developer Forums here where I described what I've tested so far.

If you need any logs or anything else just let me know.

Yeah, I'm trying to understand what "does not work" means.

Yeah, I'm trying to understand what "does not work" means.

Sorry, I just realized that I didn't provide much information...

Transcoding is not working with driver version 550.54.14 and it fails on Jellyfin with that error:

...
frame=    1 fps=0.0 q=0.0 size=N/A time=00:00:00.00 bitrate=N/A speed=   0x    
[h264_nvenc @ 0x561522fa7240] Failed locking bitstream buffer: invalid param (8): 
Error submitting video frame to the encoder
[libfdk_aac @ 0x561522fa4400] 2 frames left in the queue on closing
Conversion failed!

and on Plex with that error:

...
Feb 26, 2024 15:29:24.384 [23302936435512] Fehlersuche — Jobs: '/usr/lib/plexmediaserver/Plex Transcoder' exit code for process 1386 is -9 (signal: Killed)
...

Sorry, I will try to find more useful information in the Plex log but it basically falls back to Software transcoding.

Following up on @elezar's suggestion of adding --device /dev/nvidia-caps-imex-channels/channel0, can you try running the following on the host (which should create /dev/nvidia-caps-imex-channels/channel0) and then test again:

nvidia-modprobe -i 0:1

I'm hoping this isn't an issue, but I want to rule it out.

After running:

nvidia-modprobe -i 0:1

I can confirm that the device:

/dev/nvidia-caps-imex-channels/channel0

was created, but sadly enough it's still the same when passing through this device.
The Jellyfin log gives me that:

...
frame=    1 fps=0.0 q=0.0 size=N/A time=00:00:00.00 bitrate=N/A speed=   0x    
[h264_nvenc @ 0x55634ff34c40] Failed locking bitstream buffer: invalid param (8): 
Error submitting video frame to the encoder
[libfdk_aac @ 0x55634ff2fa40] 2 frames left in the queue on closing
Conversion failed!

And Plex falls back to Software transcoding.

Here is the docker run output:

docker run
  -d
  --name='Jellyfin'
  --net='bridge'
  -e TZ="Europe/Berlin"
  -e HOST_OS="Unraid"
  -e HOST_HOSTNAME="Test"
  -e HOST_CONTAINERNAME="Jellyfin"
  -e 'NVIDIA_VISIBLE_DEVICES'='GPU-09e16239-57bc-2ca8-39ca-c72ed08bac48'
  -e 'NVIDIA_DRIVER_CAPABILITIES'='all'
  -e 'PUID'='99'
  -e 'PGID'='100'
  -l net.unraid.docker.managed=dockerman
  -l net.unraid.docker.webui='http://[IP]:[PORT:8096]/'
  -l net.unraid.docker.icon='https://raw.githubusercontent.com/ich777/docker-templates/master/ich777/images/jellyfin.png'
  -p '8096:8096/tcp'
  -p '8920:8020/tcp'
  -v '/mnt/user/Filme':'/mnt/movies':'ro'
  -v '/mnt/user/Serien':'/mnt/tv':'ro'
  -v '/mnt/cache/appdata/jellyfin/cache':'/cache':'rw'
  -v '/mnt/cache/appdata/jellyfin':'/config':'rw'
  --device='/dev/nvidia-caps-imex-channels/channel0'
  --group-add=18
  --runtime=nvidia 'jellyfin/jellyfin' 

Well from our perspective that is "good news", as it means that it doesn't appear to be an issue with the container toolkit, but rather something else.

Well from our perspective that is "good news", as it means that it doesn't appear to be an issue with the container toolkit, but rather something else.

Should I open an issue somewhere with the information from here just to keep track of it and see if someone else has a similar issue or should I wait until someone answers on the Developer Forums?
Do you think it is worth testing the open source Kernel module?

Just to let you know @elezar and @klueska the driver version that was released today v550.67 solves this issue and NVENC is working again in combination with Docker.