Building libnvidia-container 1.14.5 builds 1.14.4

Question

Building libnvidia-container 1.14.5 builds 1.14.4

ich777 opened this issue 4 months ago · 19 comments

A user on the Unraid Forums reported an issue that the the latest driver package is not working properly with Docker after trying it myself I was able to reproduce this error.
I started investigating and building the libraries manually.

When I checkout v1.14.5 I get this output:

HEAD is now at 870d7c5d Merge branch 'cherry-pick-1.14.4' into 'release-1.14'

When I continue the build process these libraries get built:

...
usr/lib/debug/usr/lib/libnvidia-container.so.1
usr/lib/libnvidia-container-go.so.1.14.4
usr/lib/libnvidia-container.so.1
usr/lib/libnvidia-container-go.so.1
usr/lib/pkgconfig/
usr/lib/pkgconfig/libnvidia-container.pc
usr/lib/libnvidia-container-go.so
usr/lib/libnvidia-container.so.1.14.4
...

When building nvidia-container-toolkit it is properly building version 1.14.5

I assume that the driver is not working properly with Docker when using libnvidia-container 1.14.4 and nvidia-container-toolkit 1.14.5.

Did I do something wrong or is this an oversight?

Cheers,
Christoph

Answer 1 · 2024-02-26T09:20:05.000Z

We updated the version logic to use git describe --tags to extract the version information. The issue here is that the v1.14.4 and v1.14.5 tags are the same. In order to override the version you can set the LIB_VERSION and LIB_TAG make variables.

This is the logic that we use when building this as part of the NVIDIA Container Toolkit repo here: https://github.com/NVIDIA/nvidia-container-toolkit/blob/0409824106214a55df4c89a41f12c48f492cd51b/scripts/build-all-components.sh#L58-L64

Answer 2 · 2024-02-26T09:29:23.000Z

@elezar thanks for the information, will look into that and how I can implement that in my build toolchain.

Anyways, I have to investigate a bit further since even with version 1.14.4 and driver version 550.54.14 I can't utilize my T400 in Docker containers.

Answer 3 · 2024-02-26T10:23:13.000Z

@elezar did something else change too since I now get this error when trying to create a container with v1.14.5:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 2: unknown.

EDIT: This is caused because I tried to compile with go 1.22.0 switching back to go 1.20.13 and everything is working, however I'm still not able to use driver version 550.54.14 with Docker.

Answer 4 · 2024-02-26T11:11:00.000Z

I'll close this issue since this is resolved.

Answer 5 · 2024-02-26T12:25:34.000Z

@ich777 please open an issue against the nvidia-container-toolkit repo with the error messages you're seeing with the 550 driver.

Answer 6 · 2024-02-26T13:15:14.000Z

@elezar I already created an issue on the Developer Forums from Nvidia here since I don't think that it's related to libnvidia-container nor nvidia-contaienr-toolkit.
This only happens with driver 550.54.14 and not with 550.40.07 and earlier.

Answer 7 · 2024-02-26T13:19:25.000Z

Does nvidia-smi work in the container, or is it applications that are failing?

/cc @klueska

Answer 8 · 2024-02-26T13:36:22.000Z

Does nvidia-smi work in the container, or is it applications that are failing?

nvidia-smi is working just fine in the container yes.

Answer 9 · 2024-02-26T13:47:28.000Z

There was a new feature included in the 550.54.14 that requires additional support in the NVIDIA Container Toolkit.

We are working to release a version that includes this support but are waiting for some driver components to be published.

For now, could you confirm that adding the --device /dev/nvidia-caps-imex-channels/channel0 device to your container allows it to function.

Answer 10 · 2024-02-26T14:12:07.000Z

For now, could you confirm that adding the --device /dev/nvidia-caps-imex-channels/channel0 device to your container allows it to function.

That does not work, I just have these devices:

root@Test:~# ls -la /dev/nvidia*
crw-rw-rw- 1 root root 195, 254 Feb 26 15:07 /dev/nvidia-modeset
crw-rw-rw- 1 root root 240,   0 Feb 26 15:08 /dev/nvidia-uvm
crw-rw-rw- 1 root root 240,   1 Feb 26 15:08 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Feb 26 15:07 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Feb 26 15:07 /dev/nvidiactl

/dev/nvidia-caps:
total 0
drwxr-xr-x  2 root root     80 Feb 26 15:08 ./
drwxr-xr-x 17 root root   3380 Feb 26 15:08 ../
cr--------  1 root root 244, 1 Feb 26 15:08 nvidia-cap1
cr--r--r--  1 root root 244, 2 Feb 26 15:08 nvidia-cap2

This is the output from nvidia-smi from the container (of course without the path that you've mentioned):

root@8a1b7fbf37e8:/# nvidia-smi
Mon Feb 26 15:10:41 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T400                    Off |   00000000:01:00.0 Off |                  N/A |
| 36%   38C    P0             N/A /   31W |       0MiB /   2048MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@8a1b7fbf37e8:/#

Answer 11 · 2024-02-26T14:14:39.000Z

What was your original error? You don't seem to include it in the description (unless I missed it somehow).

Answer 12 · 2024-02-26T14:18:41.000Z

What was your original error? You don't seem to include it in the description (unless I missed it somehow).

Some users on the Unraid Forums reported that transcoding with Plex was not working alongside with Jellyfin, so to speak NVENC.

I was able to reproduce this on my test machine, I made a short post on the Nvidia Developer Forums here where I described what I've tested so far.

If you need any logs or anything else just let me know.

Answer 13 · 2024-02-26T14:19:56.000Z

Yeah, I'm trying to understand what "does not work" means.

Answer 14 · 2024-02-26T14:31:20.000Z

Yeah, I'm trying to understand what "does not work" means.

Sorry, I just realized that I didn't provide much information...

Transcoding is not working with driver version 550.54.14 and it fails on Jellyfin with that error:

...
frame=    1 fps=0.0 q=0.0 size=N/A time=00:00:00.00 bitrate=N/A speed=   0x    
[h264_nvenc @ 0x561522fa7240] Failed locking bitstream buffer: invalid param (8): 
Error submitting video frame to the encoder
[libfdk_aac @ 0x561522fa4400] 2 frames left in the queue on closing
Conversion failed!

and on Plex with that error:

...
Feb 26, 2024 15:29:24.384 [23302936435512] Fehlersuche — Jobs: '/usr/lib/plexmediaserver/Plex Transcoder' exit code for process 1386 is -9 (signal: Killed)
...

Sorry, I will try to find more useful information in the Plex log but it basically falls back to Software transcoding.

Answer 15 · 2024-02-26T14:48:46.000Z

Following up on @elezar's suggestion of adding --device /dev/nvidia-caps-imex-channels/channel0, can you try running the following on the host (which should create /dev/nvidia-caps-imex-channels/channel0) and then test again:

nvidia-modprobe -i 0:1

I'm hoping this isn't an issue, but I want to rule it out.

Answer 16 · 2024-02-26T14:55:48.000Z

After running:

nvidia-modprobe -i 0:1

I can confirm that the device:

/dev/nvidia-caps-imex-channels/channel0

was created, but sadly enough it's still the same when passing through this device.
The Jellyfin log gives me that:

...
frame=    1 fps=0.0 q=0.0 size=N/A time=00:00:00.00 bitrate=N/A speed=   0x    
[h264_nvenc @ 0x55634ff34c40] Failed locking bitstream buffer: invalid param (8): 
Error submitting video frame to the encoder
[libfdk_aac @ 0x55634ff2fa40] 2 frames left in the queue on closing
Conversion failed!

And Plex falls back to Software transcoding.

Here is the docker run output:

docker run
  -d
  --name='Jellyfin'
  --net='bridge'
  -e TZ="Europe/Berlin"
  -e HOST_OS="Unraid"
  -e HOST_HOSTNAME="Test"
  -e HOST_CONTAINERNAME="Jellyfin"
  -e 'NVIDIA_VISIBLE_DEVICES'='GPU-09e16239-57bc-2ca8-39ca-c72ed08bac48'
  -e 'NVIDIA_DRIVER_CAPABILITIES'='all'
  -e 'PUID'='99'
  -e 'PGID'='100'
  -l net.unraid.docker.managed=dockerman
  -l net.unraid.docker.webui='http://[IP]:[PORT:8096]/'
  -l net.unraid.docker.icon='https://raw.githubusercontent.com/ich777/docker-templates/master/ich777/images/jellyfin.png'
  -p '8096:8096/tcp'
  -p '8920:8020/tcp'
  -v '/mnt/user/Filme':'/mnt/movies':'ro'
  -v '/mnt/user/Serien':'/mnt/tv':'ro'
  -v '/mnt/cache/appdata/jellyfin/cache':'/cache':'rw'
  -v '/mnt/cache/appdata/jellyfin':'/config':'rw'
  --device='/dev/nvidia-caps-imex-channels/channel0'
  --group-add=18
  --runtime=nvidia 'jellyfin/jellyfin'

Answer 17 · 2024-02-26T14:59:46.000Z

Well from our perspective that is "good news", as it means that it doesn't appear to be an issue with the container toolkit, but rather something else.

Answer 18 · 2024-02-26T15:06:37.000Z

Well from our perspective that is "good news", as it means that it doesn't appear to be an issue with the container toolkit, but rather something else.

Should I open an issue somewhere with the information from here just to keep track of it and see if someone else has a similar issue or should I wait until someone answers on the Developer Forums?
Do you think it is worth testing the open source Kernel module?

Answer 19 · 2024-03-19T18:36:20.000Z

Just to let you know @elezar and @klueska the driver version that was released today v550.67 solves this issue and NVENC is working again in combination with Docker.