ipaqmaster/vfio

Single gpu passthrough - Error when unbinding nvidia gpu

mez0ru opened this issue · 7 comments

I used the script first to install a windows 10, then after installing all the drivers, I attempted at doing single passthrough, although at first the vbios I used did not match my gpu, I then generated the vbios from /sys/bus/pci/devices/0000:01:00.0/rom which worked without even patching (already patched), and it gave me a visual clue as my OS logo went away when I used the vbios, but I tried again and again even using remote ssh, still the same issue, the unbinding never work, and it gets stuck and I have to reboot. Using tty3, I managed to see that the unbind command is frozen and it's eating one core from my cpu. I don't know what's the issue, but here's my output:

main -image win10_nv_passthr.qcow2 -imageformat qcow2 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -vbios generated_vbios2.rom -pci 'NVIDIA' -usb 'HyperX|G203|K120' -run

-pinvcpus       not specified, guest will execute on full host CPU without any pinning: (6) with (6) threads.
-memory         not specified, will use half host total:15986 MB
-image(s)       specified, using virtual disk(s) this run:
                1
                  Path:         win10_nv_passthr.qcow2
                  Format:       qcow2
-romfile        specified, if a GPU is detected in the -pci arguments this romfile will be used.
                generated_vbios2.rom
                Please confirm your romfile is safe with a project such as rom-parser before using this feature
-bridge/-nonet  not specified, VM will be given a NAT adapter
                with a random mac suffix (guest to host) this run.
                OK for most applications.
USB:
  Matched: 046d:c092 'Logitech, Inc. G203 LIGHTSYNC Gaming Mouse'
    Added to USB Args as:       -device usb-host,vendorid=0x046d,productid=0xc092

  Matched: 0951:16a4 'Kingston Technology HyperX 7.1 Audio'
    Added to USB Args as:       -device usb-host,vendorid=0x0951,productid=0x16a4

  Matched: 046d:c31c 'Logitech, Inc. Keyboard K120'
    Added to USB Args as:       -device usb-host,vendorid=0x046d,productid=0xc31c

PCI:
  Matched:      01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
  IOMMU Group:  1
    [INFO] Detected driver nvidia is using this device. It will be re-bound on VM exit.
    Video device 001[10de:1b82] is bound to a driver which isn't vfio-pci and could be in use by the DM, framebuffer or otherwise.
    For this reason the script will now attempt to stop the display-manager service and unbind
    the efi framebuffer instead of risking a driver unbind deadlock in waiting for X to quit.
    If your X server and virtual consoles don't use this card you can unbind it from its driver manually before running this script.
    Stopping display-manager and unbinding console drivers in 5 seconds...
    Unbinding from:     nvidia
main: line 376:  6221 Done                    echo "$fullBuspath"
      6222 Killed                  | sudo timeout --signal 9 5 tee /sys/bus/pci/devices/$fullBuspath/driver/unbind > /dev/null
    The device  0000:01:00.0 // 10de:1b82  Was unable to unbind after 5 seconds, is something else using it?
    (E.g This will happen to a GPU in use by X)
    Giving up.

Cleaning up..
PCI:
  Matched:      01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
  IOMMU Group:  1
    Rebinding 10de:1b82 back to driver: nvidia
tee: '/sys/bus/pci/devices/0000:01:00.0/driver/unbind': No such file or directory
tee: /sys/bus/pci/drivers/vfio-pci/remove_id: No such device

The iommu groups:

IOMMU Group 0 
        00:00.0 Host bridge [0600]: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers [8086:3ec2] (rev 07) 
IOMMU Group 1 
        00:01.0 PCI bridge [0604]: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 07) 
        01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1) 
        01:00.1 Audio device [0403]: NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] (rev a1) 
IOMMU Group 2 
        00:02.0 Display controller [0380]: Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] [8086:3e92] 
IOMMU Group 3 
        00:08.0 System peripheral [0880]: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model [8086:1911] 
IOMMU Group 4 
        00:14.0 USB controller [0c03]: Intel Corporation 200 Series/Z370 Chipset Family USB 3.0 xHCI Controller [8086:a2af] 
        00:14.2 Signal processing controller [1180]: Intel Corporation 200 Series PCH Thermal Subsystem [8086:a2b1] 
IOMMU Group 5 
        00:16.0 Communication controller [0780]: Intel Corporation 200 Series PCH CSME HECI #1 [8086:a2ba] 
IOMMU Group 6 
        00:17.0 SATA controller [0106]: Intel Corporation 200 Series PCH SATA controller [AHCI mode] [8086:a282] 
IOMMU Group 7 
        00:1c.0 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #1 [8086:a290] (rev f0) 
IOMMU Group 8 
        00:1c.3 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #4 [8086:a293] (rev f0) 
IOMMU Group 9 
        00:1c.6 PCI bridge [0604]: Intel Corporation 200 Series PCH PCI Express Root Port #7 [8086:a296] (rev f0) 
IOMMU Group 10 
        00:1f.0 ISA bridge [0601]: Intel Corporation Z370 Chipset LPC/eSPI Controller [8086:a2c9] 
        00:1f.2 Memory controller [0580]: Intel Corporation 200 Series/Z370 Chipset Family Power Management Controller [8086:a2a1] 
        00:1f.3 Audio device [0403]: Intel Corporation 200 Series PCH HD Audio [8086:a2f0] 
        00:1f.4 SMBus [0c05]: Intel Corporation 200 Series/Z370 Chipset Family SMBus Controller [8086:a2a3] 
IOMMU Group 11 
        03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15) 
IOMMU Group 12 
        04:00.0 Network controller [0280]: Intel Corporation Wi-Fi 6 AX200 [8086:2723] (rev 1a)
uname -a :
Linux fedora 5.17.7-200.fc35.x86_64

Hi! And thanks for giving the script a try.

This is likely the problem:

Stopping display-manager and unbinding console drivers in 5 seconds...
Unbinding from: nvidia
main: line 376: 6221 Done echo "$fullBuspath"
6222 Killed | sudo timeout --signal 9 5 tee /sys/bus/pci/devices/$fullBuspath/driver/unbind > /dev/null
The device 0000:01:00.0 // 10de:1b82 Was unable to unbind after 5 seconds, is something else using it?
(E.g This will happen to a GPU in use by X)
Giving up.

The script tried to unbind your GPU from the nvidia driver, but the unbind command failed to return after 5 seconds of waiting. This only happens in the Linux kernel if something is still using your Nvidia GPU preventing it from being free to unbind from the nvidia driver

I have a few questions to get this moving for you

  1. Can you please advise what your display manager is? Do you know if it has a custom service name which isn't generalized as display-manger ?
  2. Are you running any other applications or CLI tools which may be using the GPU as the script tries to unbind the card?
  3. Can you please provide the output of nvidia-smi to see the process list of what's using your card?
  1. I'm using Fedora Workstation 35 which comes with gnome, but I later installed kde plasma on it, the version is 5.24.4 with GDM.
  2. I don't think I have, since I tried multiple times with fresh startup, I also tried logging out of kde (so that the harddrives are mounted) then tried the script using remote ssh, still the same result.
  3. This time, I tried to stop the display-manager as stated in the script to see which processes are still using nvidia, to my surprise, nothing showed on nvidia-smi, after running the script it's still not able to unbind the gpu.

nvidia-smi:

Sat May 21 12:38:17 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   52C    P8    11W / 180W |      2MiB /  8116MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and thanks for the help btw.

It's a difficult one. The unbind process will hang with 100% cpu (Not sure why...) when something is locking the device you want to unbind, I haven't seen too many threads which cover what to do for Fedora installs with an answer though.

Could it possibly be the Nvidia Persistance Daemon? (If that is even included and enabled automatically on Fedora 35). It's probably under systemctl status nvidia-persistenced. If present, stopping it may be the ticket to unbinding this gpu.

Otherwise I may have to fire up a Fedora install on a usb to try and reproduce this.

Sorry for the late, in the end the problem had many layers unfortunately, first, nvidia was not unbinding because:

$ lsmod | grep -i nvidia
nvidia_drm             69632  11
nvidia_modeset       1204224  25 nvidia_drm
nvidia_uvm           1187840  2
nvidia              35377152  2288 nvidia_uvm,nvidia_modeset
drm_kms_helper        339968  2 nvidia_drm,i915
drm                   622592  21 drm_kms_helper,kvmgt,nvidia,nvidia_drm,i915,ttm

As you can see, around 6 modules was using nvidia, so it was impossible to unbind unless all of them are removed.
that was the first clue, when I modprobe -r every module in the list, then i can finally unbind successfully.
Then the real issue happened, blackscreen or no display, so I sat the whole day trying to fix it, in the end the problem was caused by nvidia, it's an old problem 43, which is caused by old driver < 460 I think, they already fixed the issue. But as surprising as it may sounds, windows update downloaded the driver that had this issue when nvidia did not allow the passthrough, the workaround was adding vendor id and kvm hidden so that the driver does not detect it's a passthrough.
After I updated the driver, I should finally be able to remove the vendor id and kvm hidden.

That's all I think, but one last thing, I did make it work by using this guide:
https://gitlab.com/Karuri/vfio
and using virt manager, so it was an entirely different approach but I think they are pretty similar from what I can read from your script, it's just modprobe -r the modules that nvidia driver was using are different, that's why your script was unable to unbind my gpu, other than that you could add kvm hidden and vendor id (optionally) to circumvent this issue in case it did happen.
Your script is great btw, I hope you can fix these issues I mentioned, so I can use it instead. Thanks for the help!

Your solution will work however the script dynamically unbinds hardware to avoid having to rmmod those 6 kernel modules every single run, while also leaving open the opportunity of having them loaded for a potential second graphics card. If this isn't a bother for you given this single scenario then that is fine.

You will get a Code 43 if the vbios you've passed through is invalid for your card or not yet patched. It can also happen with other pci passthrough issues such as the virtual topology being odd.

Also, did you end up checking if the nvidia-persistenced service was running and try stopping that if it was?

I will fire up a Fedora Workstation 35 install some time this week to see if I can find what was locking the unbind request and put a permanent handler into the script.

Hey mzo0ru,

Did the above solution of stopping the nvidia-persistenced service work?

Closing this for the time being