ublue-os/hwe

ublue-nvctk-cdi.service should be refactored to udev rule

Opened this issue ยท 4 comments

In the Nvidia images, we have the ublue-nvctk-cdi.service to support containers.

The only dependencies this service has is if the binary exists, is executable, and we are after local-fs.target. This is problematic because it will always run even if the Nvidia modules are not loaded due to an Nvidia card not being present. For eGPUs, the Nvidia card is not present until much later in the boot process. Instead of using a service, this should be handled via udev rule since this script is dependent on the necessary hardware being present. Right now with an eGPU, you have to manually restart the service before entering any containers.

I'll try converting the service to a udev rule to test.

A related concern was reported in Discord ( https://discord.com/channels/1072614816579063828/1072617059265032342/1232829046036103231 ) where if the nvidia GPU has been disabled (for example, BIOS disabled dGPU on a dual GPU laptop), then this fails erroneously.

I should finally fix this bug.

This will also fail if the nvidia card isn't "ready". We've seen internal A4000 also throw this error.

Hello! I'm the user mentionned by @bsherman
The system this happened on is running a custom image based on ublue-kinoite-nvidia image (No nvidia related change applied downstream of ublue, only surface stuff so far). As described, the dGPU is disabled in BIOS when this happens, no error in Hybrid mode. This is the systemd log of the failed service:

ร— ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation
     Loaded: loaded (/usr/lib/systemd/system/ublue-nvctk-cdi.service; enabled; preset: enabled)
    Drop-In: /usr/lib/systemd/system/service.d
             โ””โ”€10-timeout-abort.conf
     Active: failed (Result: exit-code) since Thu 2024-04-25 16:49:45 CEST; 2h 27min ago
   Main PID: 5074 (code=exited, status=1/FAILURE)
        CPU: 28ms

Apr 25 16:49:45 fedora systemd[1]: Starting ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation...
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=info msg="Auto-detected mode as \"nvml\""
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_DRIVER_NOT_LOADED"
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Main process exited, code=exited, status=1/FAILURE
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Failed with result 'exit-code'.
Apr 25 16:49:45 fedora systemd[1]: Failed to start ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation.

And here is what journalctl -xeu returns for this service:

Apr 25 16:49:45 fedora systemd[1]: Starting ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation...
โ–‘โ–‘ Subject: A start job for unit ublue-nvctk-cdi.service has begun execution
โ–‘โ–‘ Defined-By: systemd
โ–‘โ–‘ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
โ–‘โ–‘
โ–‘โ–‘ A start job for unit ublue-nvctk-cdi.service has begun execution.
โ–‘โ–‘
โ–‘โ–‘ The job identifier is 331.
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=info msg="Auto-detected mode as \"nvml\""
Apr 25 16:49:45 fedora nvidia-ctk[5074]: time="2024-04-25T16:49:45+02:00" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_DRIVER_NOT_LOADED"
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Main process exited, code=exited, status=1/FAILURE
โ–‘โ–‘ Subject: Unit process exited
โ–‘โ–‘ Defined-By: systemd
โ–‘โ–‘ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
โ–‘โ–‘
โ–‘โ–‘ An ExecStart= process belonging to unit ublue-nvctk-cdi.service has exited.
โ–‘โ–‘
โ–‘โ–‘ The process' exit code is 'exited' and its exit status is 1.
Apr 25 16:49:45 fedora systemd[1]: ublue-nvctk-cdi.service: Failed with result 'exit-code'.
โ–‘โ–‘ Subject: Unit failed
โ–‘โ–‘ Defined-By: systemd
โ–‘โ–‘ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
โ–‘โ–‘
โ–‘โ–‘ The unit ublue-nvctk-cdi.service has entered the 'failed' state with result 'exit-code'.
Apr 25 16:49:45 fedora systemd[1]: Failed to start ublue-nvctk-cdi.service - ublue nvidia container toolkit CDI auto-generation.
โ–‘โ–‘ Subject: A start job for unit ublue-nvctk-cdi.service has failed
โ–‘โ–‘ Defined-By: systemd
โ–‘โ–‘ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
โ–‘โ–‘
โ–‘โ–‘ A start job for unit ublue-nvctk-cdi.service has finished with a failure.
โ–‘โ–‘
โ–‘โ–‘ The job identifier is 331 and the job result is failed.

As I mentioned on discord, I think disabling the dGPU shouldn't be a source of error, as this has a HUGE impact on battery life, and if I don't plan on doing something that requires the dGPU, I think it's best to just disable it until I need it. In this case, I think displaying warnings at most would be ideal.

@m2Giles and I were discussing this, and we can replace the service with a udev rule which calls the device gets added.

ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", RUN{program}="/usr/bin/nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml"

Something like this?