ipaqmaster/vfio

PCI binding/unbinding fails and hangs, despite running on integrated graphics

redeven opened this issue · 6 comments

Attempting to passthrough the GPU (Nvidia), the process fails and throws an error. Furthermore, the clean-up process hangs the first time it's run.
Xorg is using iGPU (AMD), system is set to Integrated graphics only (through optimus-manager), no processes show on nvidia-smi (even Xorg).

image

image

image

Issue seems to happen for any pci device. Reproduced by removing the GPU from the -pci argument, still fails for ssd.

image

The script seems to be working again for me since commit d32539a. I was able to pass through the AX200 wireless chip on my motherboard and back to my host just fine.

I suspect what you're experiencing now is some deadlock due to the back and forth nature seen in your screenshots. This commonly happens to some pci devices and motherboards after a few rounds of unbinding/rebinding or module unloading/reloading (Which also implies unbinding/rebinding). Have you tried rebooting? When PCI devices hang on rebinding attempts a reboot usually does it for me.

If a reboot doesn't do it - you'll need to figure out what else could be using that pci device you're trying to pass through.

It also appears you're using -usb without any valid arguments? If you continue to have problems you should share the full ./main command you're running from this project 🙂

Looking more closely at the output did you mean to pass both NVMe devices to your guest? It seems to have matched both your WD Blue SN570 and also the Samsung one. If you're booted into either of those that would also prevent you from giving both to the guest.

A reboot doesn't fix the issue (in fact, shows the hanging on Cleanup issue that only shows the first time). I'm unsure on how to track what else could be using the GPU device, if nvidia-smi showed nothing of use.

-usb has a second pair of keyboard+mouse, that I had unplugged, I reattached them for this test.

I did mean to pass both NVMe devices to -pci, those are my older SSDs that have my Windows C:/ and D:/ drives, and aren't mounted. The running Linux system is on a separate NVMe.

Here's a screenshot from a test run on a freshly rebooted system, that booted in Integrated graphics mode by default (as opposed to starting in Hybrid and changing to Integrated after a logout>login). It's about as clean as I can make it.

image

-usb has a second pair of keyboard+mouse, that I had unplugged, I reattached them for this test.

Ah I see.


It actually seems you're getting a different error in that latest screenshot.

The script tried to unbind your nvidia GPU and something is preventing that. This is typically due to something else using it such as a display server.

Do you have a display server running with that PCI GPU included? You can usually check this with lsof /dev/dri/by-path/* which will help confirm this (Regardless of X11 conf settings this can still happen.)

Otherwise you can also check if the nvidia persistence daemon may be running, which will also cause the problem you're seeing: systemctl status nvidia-persistenced

Honestly, no idea what the issue actually is, and I've now fully given up on VM gaming on this setup, will revisit in 5 years if I get an AMD gpu.

Closing the issue since it's confirmed to be unrelated to this script.