AMD Single GPU Passthrough hangs up and never finishes
shekhars-li opened this issue · 16 comments
I have a AMD CPU + single GPU (AMD 5600xt). I already installed Win10 and verified it's running fine without PCI passthrough. I then ran following:
sudo ./main -mem 8G -image /var/lib/libvirt/images/win10.qcow2 -imageformat qcow2 -bridge tap0,enp34s0 -pci 'Radeon|USB|HDMI Audio' -ignorevtcon -run -bios /usr/share/OVMF/OVMF_CODE_4M.fd -vbios /usr/share/vgabios/Sapphire.RX5600XT.6144.200314.rom -killx
-ignoreVtconn specified, efi-framebuffer/vtcon bindings will be left as is.
AMD cards don't mind vtcons; this argument is to workaround a recent
NULL pointer dereference bug in fbcon.c) on NVIDIA-powered hosts
Follow the bug report here: https://bugzilla.kernel.org/show_bug.cgi?id=216475
-pinvcpus not specified, Guest will get half host's core total as vcpus (No pinning): 3 hyperthreaded vcpu's (6/2) for a total of 6 vcpu threads (12/2).
-memory specified, guest will receive: 8192 MB
-image(s) specified, using virtual disk(s) this run:
Driver: virtio-blk-pci
1
Path: /var/lib/libvirt/images/win10.qcow2
Format: qcow2
-romfile specified, if a GPU is detected in the -pci arguments this romfile will be used.
/usr/share/vgabios/Sapphire.RX5600XT.6144.200314.rom
Please confirm your romfile is safe with a project such as rom-parser before using this feature
Host int not specified, will attach VM tap to existing bridge
enp34s0 exists and is up, will attach tap0 to that.
ioctl(TUNSETIFF): Device or resource busy
RTNETLINK answers: Operation not supported
------------------
Bridge details:
enp34s0:
Bridge already existed, not running dhclient -r on it.
------------------
-bridge specified, VM will be bridged to the host with a tap adapter.
PCI:
vfio-pci isn't loaded. Loading it now.
Matched: 03:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller [1022:43d5] (rev 01)
IOMMU Group: 17
[INFO] Detected driver xhci_hcd is using this device. It will be re-bound on VM exit.
Adding ID and binding to: vfio-pci
Matched: 28:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev ca)
IOMMU Group: 20
[INFO] Detected driver amdgpu is using this device. It will be re-bound on VM exit.
Unbinding GPU from: amdgpu...
It appears Xorg has latched onto this GPU, cannot unbind from driver and give to guest without killing Xorg.
Stopping display-manager and unbinding console drivers...
PID TTY STAT TIME COMMAND
1 ? Ss 0:02 /sbin/init splash
959 ? Ss 0:00 /lib/systemd/systemd-logind
19816 tty2 Sl+ 0:01 /usr/lib/xorg/Xorg vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -noreset -keeptty -verbose 3
19939 ? Ssl 0:03 /usr/bin/gnome-shell
./main: line 477: 25000 Done echo "$fullBuspath"
25001 Killed | sudo timeout --signal 9 5 tee /sys/bus/pci/devices/$fullBuspath/driver/unbind > /dev/null 2>&1
Failed... Trying again with X killed...
This GPU is free.
Adding ID and binding to: vfio-pci
./main: line 479: 25047 Done echo "0x$vendor 0x$class"
25048 Killed | sudo timeout --signal 9 5 tee /sys/bus/pci/drivers/vfio-pci/new_id > /dev/null 2>&1
The device 0000:28:00.0 // 1002:731f Was unable to bind via new_id after 5 seconds, is something else using it?
(E.g This will happen to a GPU in use by X)
Giving up.
Cleaning up..
We only used tap0 on an existing bridge this run, removing tap0.
tap0 removed.
PCI:
vfio-pci isn't loaded. Loading it now.
Matched: 03:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller [1022:43d5] (rev 01)
IOMMU Group: 17
Rebinding 1022:43d5 back to driver: xhci_hcd
Successfully rebound.
Matched: 28:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev ca)
IOMMU Group: 20
Rebinding 1002:731f back to driver: amdgpu
This never returns. I checked sudo lsof | grep amdgpu
. This is the output:
amdgpu_dm 338 root cwd DIR 259,2 4096 2 /
amdgpu_dm 338 root rtd DIR 259,2 4096 2 /
amdgpu_dm 338 root txt unknown /proc/338/exe
amdgpu_dm 339 root cwd DIR 259,2 4096 2 /
amdgpu_dm 339 root rtd DIR 259,2 4096 2 /
amdgpu_dm 339 root txt unknown /proc/339/exe
amdgpu_dm 340 root cwd DIR 259,2 4096 2 /
amdgpu_dm 340 root rtd DIR 259,2 4096 2 /
amdgpu_dm 340 root txt unknown /proc/340/exe
amdgpu_dm 341 root cwd DIR 259,2 4096 2 /
amdgpu_dm 341 root rtd DIR 259,2 4096 2 /
amdgpu_dm 341 root txt unknown /proc/341/exe
tee 25134 root 3w REG 0,22 4096 36209 /sys/bus/pci/drivers/amdgpu/bind
lspci -k during this returns:
28:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev ca)
Subsystem: Sapphire Technology Limited Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
Kernel modules: amdgpu
28:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
I also manually tried modprobe -r amdgpu
after killing gdm. This never works either. What am I doing wrong?
Hey there
In the middle of the script run there it looks like it failed to detach the graphics card from the amdgpu driver from the very beginning likely due to X. Then tried to kill X as permitted with -killx
thinking that might do the trick but it looks like after the systemctl stop display-manager
command were issued it looks like L:464 was still able to see Xorg which may have still been the cause. Regardless, Something is preventing you from freeing up the graphics card by continuing to use it..
Unfortunately I don't have any AMD graphics cards to test this with but the moment I end up with one I'll make sure the script knows everything about them for unbinding purposes and the hang at the end while not intended won't influence your actual problem. If anything I can look into prefixing additional timeouts to those cleanup rebinding attempts just in case of this scenario so you can at least get your shell back.
If you can make this happen again I'm not sure if AMD cards list themselves under /dev/dri
but you could certainly try checking sudo lsof /dev/dri/*
to make sure nothing pops up. If anything does pop up when running that command, you've found your culprit.
You should also check if systemctl status display-manager
is even a real service which appears on your machine. If it is not then your X server will have to be killed a different way.
Please let me know how you go with the above two command checks.
Hi @ipaqmaster Thanks a lot for responding and thanks for creating this script!
Here's a response from sudo lsof /dev/dri/*
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
systemd 1 root 141u CHR 226,0 0t0 408 /dev/dri/card0
systemd-l 880 root 54u CHR 226,0 0t0 408 /dev/dri/card0
Xorg 2262 shekhars mem CHR 226,0 408 /dev/dri/card0
Xorg 2262 shekhars 12u CHR 226,0 0t0 408 /dev/dri/card0
Xorg 2262 shekhars 13u CHR 226,0 0t0 408 /dev/dri/card0
Xorg 2262 shekhars 14u CHR 226,0 0t0 408 /dev/dri/card0
Xorg 2262 shekhars 15u CHR 226,0 0t0 408 /dev/dri/card0
Xorg 2262 shekhars 16u CHR 226,0 0t0 408 /dev/dri/card0
Xorg 2262 shekhars 17u CHR 226,0 0t0 408 /dev/dri/card0
Xorg 2262 shekhars 18u CHR 226,0 0t0 408 /dev/dri/card0
gnome-she 2410 shekhars mem CHR 226,128 407 /dev/dri/renderD128
gnome-she 2410 shekhars 11u CHR 226,128 0t0 407 /dev/dri/renderD128
gnome-she 2410 shekhars 12u CHR 226,128 0t0 407 /dev/dri/renderD128
gnome-she 2410 shekhars 13u CHR 226,128 0t0 407 /dev/dri/renderD128
gnome-she 2410 shekhars 14u CHR 226,128 0t0 407 /dev/dri/renderD128
I have tried everything and at one point (randomly) it worked when I was using start/revert script. I changed some params and since then didn't work. I like your approach a lot and it makes sense to me. So trying to make this work. Anyway, I always try to start this script after killing display manager and verifying sudo lsof | grep amdgpu
to see if amdgpu is being used somewhere. Doesn't seem to be the case. Any other ideas I can try?
Thanks!
I killed gdm (I am logged in via ssh).
(base) shekhars@shekhars-desktop:~$ sudo systemctl stop display-manager
(base) shekhars@shekhars-desktop:~$ systemctl status display-manager
● gdm.service - GNOME Display Manager
Loaded: loaded (/lib/systemd/system/gdm.service; static; vendor preset: enabled)
Active: inactive (dead) since Sun 2024-01-14 23:10:12 PST; 36min ago
Process: 8712 ExecStartPre=/usr/share/gdm/generate-config (code=exited, status=0/SUCCESS)
Process: 8714 ExecStartPre=/usr/lib/gdm3/gdm-wait-for-drm (code=exited, status=0/SUCCESS)
Process: 8715 ExecStart=/usr/sbin/gdm3 (code=exited, status=0/SUCCESS)
Main PID: 8715 (code=exited, status=0/SUCCESS)
Jan 14 23:10:03 shekhars-desktop systemd[1]: Starting GNOME Display Manager...
Jan 14 23:10:03 shekhars-desktop systemd[1]: Started GNOME Display Manager.
Jan 14 23:10:03 shekhars-desktop gdm-launch-environment][8719]: pam_unix(gdm-launch-environment:session): session opened for user gdm by (uid=0)
Jan 14 23:10:12 shekhars-desktop systemd[1]: Stopping GNOME Display Manager...
Jan 14 23:10:12 shekhars-desktop systemd[1]: gdm.service: Succeeded.
Jan 14 23:10:12 shekhars-desktop systemd[1]: Stopped GNOME Display Manager.
(base) shekhars@shekhars-desktop:~$ sudo lsof | grep amdgpu
amdgpu_dm 337 root cwd DIR 259,2 4096 2 /
amdgpu_dm 337 root rtd DIR 259,2 4096 2 /
amdgpu_dm 337 root txt unknown /proc/337/exe
amdgpu_dm 338 root cwd DIR 259,2 4096 2 /
amdgpu_dm 338 root rtd DIR 259,2 4096 2 /
amdgpu_dm 338 root txt unknown /proc/338/exe
amdgpu_dm 339 root cwd DIR 259,2 4096 2 /
amdgpu_dm 339 root rtd DIR 259,2 4096 2 /
amdgpu_dm 339 root txt unknown /proc/339/exe
amdgpu_dm 340 root cwd DIR 259,2 4096 2 /
amdgpu_dm 340 root rtd DIR 259,2 4096 2 /
amdgpu_dm 340 root txt unknown /proc/340/exe
(base) shekhars@shekhars-desktop:~$ sudo lsof /dev/dri/*
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/125/gvfs
Output information may be incomplete.
(base) shekhars@shekhars-desktop:~$
Seems to me amdgpu should be free to be unloaded?
One final thing to add here: I followed your instructions on reddit (where I found this repo) to just try to unbind and rebind GPU. Unbind works but bind to vfio-pci gets stuck
echo 0000:28:00.0 > /sys/bus/pci/drivers/amdgpu/unbind ---> done
echo 1002 731f > /sys/bus/pci/drivers/vfio-pci/new_id ---> stuck, does not return
I do not see anything noteworthy on dmesg or syslog.
Hmm that's interesting how the unbind works just fine but trying to add its id to vfio-pci (and it quietly binding itself to the card) is hanging. Your lsof output does seem to show the GPU is not in use (Or at least nothing in that directory is being used, which may or may not contain your gpu. May want to ls /dev/dri/*
just to be certain the check isn't empty)
I'm not sure what that sudo lsof | grep amdgpu
command is supposed to be showing but it's kind of implying there are processes still interacting with amdgpu_dm
Some more questions sorry
- Does anything appear in
dmesg
after letting the command hang for a few minutes for kernel calls to start timing out? - Does your AMD gpu there on 0000:28 have any other subdevices which may need to also be unbound?
lspci -D |grep 0000:28:
should reveal any other sub-components of the graphics pci device. - What distro version and kernel version are you running there?
- What consumer motherboard or full server hardware product are you running this on?
- It seems you specified
-ignoreVtconn
in the script run. It's possible the amdgpu driver is modesetting and the efi framebuffer could be holding on to the card. Could you try unbinding and rebinding the card again but afterecho 0 > /sys/class/vtconsole/vtcon0/bind ; echo 0 > /sys/class/vtconsole/vtcon1/bind ; echo "vesa-framebuffer.0" > /sys/bus/platform/drivers/vesa-framebuffer/unbind
? This will stop the virtual consoles from drawing so you may need to SSH in from another machine to run these first.
If I can find a cheap one online I'll consider buying a second hand amd gpu to test your single gpu passthrough scenario and to try and reproduce the problem.
@ipaqmaster
I partly solved it. Your script seems to be fine. It just might be my environment. I threw kitchen sink at it.
- Updated Ubuntu 20.10 -> 22.04
- Update kernel to 5.15 to 6.7.0
- I uninstalled amdgpu and reinstalled it.
Seems to have (kind of) solved some problems.
What works:
- I can ssh, kill gdm, the start script and it Windows can start with pci/USB passed. Seems like amdgpu can unbind now.
- I can shutdown my machine cleanly via ssh.
What doesn't work:
- AMDGPU cannot rebind on shutdown.
snd_hda_intel
(my soundcard) can rebind just fine but amdgpu refuses to rebind. Not sure why - I can't run the script directly. It hangs up with black screen. I assume it can't kill gdm properly. ssh works for now.
- Windows does not see any network or graphics card. It loads basic display adapter. It may be because I need to connect to internet and download drivers?
Here's cleanup part of the run:
Cleaning up..
We only used tap0 on an existing bridge this run, removing tap0.
tap0 removed.
PCI:
Matched: 28:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev ca)
IOMMU Group: 20
Rebinding 1002:731f back to driver: amdgpu
Was unable to rebind it to amdgpu.
Matched: 28:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002:ab38]
IOMMU Group: 21
Rebinding 1002:ab38 back to driver: snd_hda_intel
Successfully rebound.
Cleanup complete.
Thank you so much for responding again and looking into this weird problem. I can at least confirm your script works perfectly if given the right conditions. Kernel version 5.15 definitely is a problem (as a read somewhere in some reddit thread as well).
To your questions:
Does anything appear in dmesg after letting the command hang for a few minutes for kernel calls to start timing out?
I see this (possibly) on attempt to rebind: [ 414.818151] amdgpu: probe of 0000:28:00.0 failed with error -22
Does your AMD gpu there on 0000:28 have any other subdevices which may need to also be unbound? lspci -D |grep 0000:28: should reveal any other sub-components of the graphics pci device.
No. GPU is alone in a group (so is soundcard):
0000:28:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev ca)
0000:28:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio
What distro version and kernel version are you running there?
20.10. Now on 22.04. Was on 5.15 earlier, now on 6.7.0-060700-generic
What consumer motherboard or full server hardware product are you running this on?
Consumer board - MSI Tomahawk 450.
It seems you specified -ignoreVtconn in the script run. It's possible the amdgpu driver is modesetting and the efi framebuffer could be holding on to the card. Could you try unbinding and rebinding the card again but after echo 0 > /sys/class/vtconsole/vtcon0/bind ; echo 0 > /sys/class/vtconsole/vtcon1/bind ; echo "vesa-framebuffer.0" > /sys/bus/platform/drivers/vesa-framebuffer/unbind ? This will stop the virtual consoles from drawing so you may need to SSH in from another machine to run these first.
Let me try this. I just did a reset and my screen is blank again.
Yes when the usual unbind commands fail on their own it's indicative of some funky environment problem though I'm always looking for gotchas to add to the script for giving a heads up where it can. I'm glad the upgrade seems to have helped a little bit.
I can't run the script directly. It hangs up with black screen.
This could be a result of unbinding the virtual consoles and their framebuffers. SSH is the best way to debug vfio gpu stuff for a single gpu host if you need to read the output.
Windows does not see any network or graphics card. It loads basic display adapter. It may be because I need to connect to internet and download drivers?
Yes the guest needs AMD drivers to use its AMD gpu 🙂 but if it continues to give you problems after installing the drivers in the guest that can be looked into.
I'm not sure why its missing its network card. Perhaps the VirtIO drivers are not installed? You can try the script argument -avoidVirtio
and possibly also -nvme
(Though it seems it was able to boot already so you may not need the nvme argument)
[ 414.818151] amdgpu: probe of 0000:28:00.0 failed with error -22
I should have noticed this earlier but your AMD card is most definitely impacted by the reset bug leaving it unable to reset itself for re-initialization by a host (Or guest for that matter). It would be worth installing gnif's vendor-reset to see if that problem goes away for you.
If you already have your distro's build tools installed (And dkms
+ git
) then this quick one-liner will fetch, compile and install it for you to try: cd ; git clone https://github.com/gnif/vendor-reset ; cd vendor-reset ; sudo dkms install .
This would definitely be a good idea for me to add and warn about in the script for when an AMD card with the reset problem is detected without vendor-reset installed.
- You were right about virtio drivers. I thought I already installed them during the setup. I was wrong. Networks works OOB now.
- I had already installed vendor-reset (and could see it's initialized). I will debug this a bit more and share any findings, in case you'd like to add checks for AMD gpu in the future.
- I still get basic display adapter and don't see my GPU even though I can see it being detached, then attached to vfio and so on. AMD drivers refuse to install as it can't detect the GPU. I don't have iGPU. Only single discreet one. So the fact display works at all means that it is being attached just fine. I will have to dig in more why it doesn't detech my GPU and show better graphics.
Thanks again for your responses and for creating this beautiful script. :)
No worries at all
Ssorry I didn't realize I hit the comment+close button with that last reply. If you still need to bounce things off me feel free to re-open the issue.
It would be worth checking that the card appears under Device Manager
in the VM - and noting and potential error codes it may have thrown being passed through. It may hint at something else to tweak
@ipaqmaster No worries. The core of the issue is resolved now. I can reliably bootup and shutdown the VM with GPU handoffs and reset working just fine. The only problem that remains is that there is an error in Windows for my GPU - "windows has stopped this device (code 43)". I have tried everything and this one seems to not go away. If you have any ideas I can try, please let me know.
If you have any ideas on dealing with the error windows throws for the GPU, please let me know. :) @ipaqmaster
I found a couple of reddit posts (for AMD devices same series as mine) that solved their problem by passing a root PCI device like this:
-device pcie-root-port,bus=pci.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 \
-device vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on \
-device vfio-pci,host=01:00.1,bus=root.1,addr=00.1 \
Since the script takes care of binding and unbinding (and quite reliably), I don't want to use qemu directly. How can I go about doing this with the script?
Code 43 is an annoying one for this series of GPU. Some have fixed it with only the vendor-reset
solution and also making sure it's loading early by adding it to their host's initramfs module list. Others have had luck by removing x-vga=on,
which can be patched out of this script with sed -i 's/x-vga=on,//g' ./main
. And other times it just suddenly works.
It may also be worth hitting that PCI rom file you've got there (/usr/share/vgabios/Sapphire.RX5600XT.6144.200314.rom
) with https://github.com/awilliam/rom-parser and making sure you have appropriately truncated if if needed, and whether you need one at all ---- Typically NVIDIA cards are the ones who truncate their own PCI rom making them initializable only once per boot - not AMD cards as far as I know.
But any GPU will throw a Code 43 if you use an unpatched or wrong bios rom of the card. It would be worth trying without specifying the -romfile
Otherwise there's no harm trying with libvirt to see if the virtual PCIe multifunction root port layout does the trick. I'm not in a position to get a version of that into the script right this minute but may be able to later.
When it Code 43's in the guest you should also check the host's dmesg log to make sure vendor-reset did its thing
Not sure there's much else I can help with here. This issue seems to be related to the local setup rather than the script. If you're still working on this and have any further updates I would be happy to keep looking into it with you as far as we can.