joeknock90/Single-GPU-Passthrough

DE keeps resetting after stopping guest

Latrolage opened this issue · 15 comments

So I have single GPU passthrough working fine except for one thing. When I shutdown the windows VM, I log back into linux (gnome) but once every couple seconds (~3 seconds ish, not 100% consistent) it will reset or something. The cursor will teleport to the bottom left and all the gnome extensions will reload and sometimes, the windows i have open will flicker a bit.

Here is my start script

#!/bin/bash
# Helpful to read output when debugging
set -x

# Stop display manager
systemctl stop display-manager.service

# Unbind VTconsoles
echo 0 > /sys/class/vtconsole/vtcon0/bind
#echo 0 > /sys/class/vtconsole/vtcon1/bind

# Unbind EFI-Framebuffer
echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/unbind

# Avoid a Race condition by waiting 2 seconds. This can be calibrated to be shorter or longer if required for your system
sleep 3

modprobe -r nvidia_drm
modprobe -r nvidia_modeset
modprobe -r nvidia
modprobe -r drm
modprobe -r nvidia-uvm

# Unbind the GPU from display driver
virsh nodedev-detach pci_0000_26_00_0
virsh nodedev-detach pci_0000_26_00_1
virsh nodedev-detach pci_0000_26_00_2
virsh nodedev-detach pci_0000_26_00_3

# Load VFIO Kernel Module  
modprobe vfio
modprobe vfio_pci
modprobe vfio_iommu_type1

And here is my end script

#!/bin/bash
set -x

modprobe -r vfio
modprobe -r vfio_pci
modprobe -r vfio_iommu_type1
sleep 5
# Re-Bind GPU to Nvidia Driver
virsh nodedev-reattach pci_0000_26_00_0
virsh nodedev-reattach pci_0000_26_00_1
virsh nodedev-reattach pci_0000_26_00_2
virsh nodedev-reattach pci_0000_26_00_3
sleep 5
# Reload nvidia modules
modprobe nvidia_drm
modprobe nvidia_modeset
modprobe nvidia
modprobe drm
modprobe nvidia-uvm

sleep 1

# Rebind VT consoles
echo 1 > /sys/class/vtconsole/vtcon0/bind
sleep 5
# Some machines might have more than 1 virtual console. Add a line for each corresponding VTConsole
#echo 1 > /sys/class/vtconsole/vtcon1/bind

nvidia-xconfig --query-gpu-info > /dev/null 2>&1

sleep 5
echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/bind

sleep 5

# Restart Display Manager
systemctl start display-manager.service

Here is a journalctl https://pastebin.ubuntu.com/p/GfgHsVsSCp/

I've updated the start and stop scripts recently. Try working with those.

Using

#!/bin/bash
# Helpful to read output when debugging
set -x

# Stop display manager
systemctl stop display-manager.service
## Uncomment the following line if you use GDM
killall gdm-x-session

# Unbind VTconsoles
echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind

# Unbind EFI-Framebuffer
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

# Avoid a Race condition by waiting 2 seconds. This can be calibrated to be shorter or longer if required for your system
sleep 2

# Unbind the GPU from display driver
virsh nodedev-detach pci_0000_26_00_0
virsh nodedev-detach pci_0000_26_00_1
virsh nodedev-detach pci_0000_26_00_2
virsh nodedev-detach pci_0000_26_00_3

# Load VFIO Kernel Module
modprobe vfio-pci

Makes the screen go black, and flicker between cryptsetup screen for full disk encryption (for some reason this screen is always in the background (behind the desktop environment) im not sure why,im on Pop!_OS)

Basically it doesn't work

Did you follow any of the troubleshooting setups (Like SSHing in and testing the start script manually)?

Let me know if you get any results there.

It stops on detaching the first PCI target.
If I modprobe -r the nvidia modules before the script attempts to detach the GPU, it works and boots into the windows VM. But my initial problem persists

Very interesting.

This might be something that is required for some and not others. I'm going to have to research this further.

Can you find out what version of you are running on Pop!?

Pop os 20.04 with xanmod kernel 5.11 and nvidia driver 460.67

Sorry somehow I missed a word! What version of Libvirt are you running?

$ libvirtd --version
libvirtd (libvirt) 6.0.0

In terms of my initial problem, I've found that whenver the DE resets, i get a new error level line in dmesg which says

[80801.853446] NVRM: Xid (PCI:0000:26:00): 8, pid=163895, Channel 00000018
[80809.915574] NVRM: Xid (PCI:0000:26:00): 8, pid=164714, Channel 00000020
[80817.979682] NVRM: Xid (PCI:0000:26:00): 8, pid=164714, Channel 00000020
[80826.043813] NVRM: Xid (PCI:0000:26:00): 8, pid=164714, Channel 00000020
[80834.107958] NVRM: Xid (PCI:0000:26:00): 8, pid=164714, Channel 00000020

The problem is really weird, on KDE, the compositor will instantly crash whenever try to start it. Restarting gdm causes other graphical glitches and is unusable until reboot.

Did you ever get this resolved? I'm having the exact same issue also with GDM.

Hey @Latrolage, @shabaduu have you guys manage to fix the issue?. I'm having the exact same thing in dmesg as well as my DE keeps resetting: https://imgur.com/Mv7JjfL

I think the problem happens after doing modprobe -r nvidia drivers for me. I tested without these lines, the resetting issues will go away but however it can't detach the graphic card.

I never managed to fix it. I spent around half a year trying to get single GPU passthrough working seamlessly but ended up giving up on this problem which i didnt manage to solve

I have seen a topic here with the same exact problem: ilayna/Single-GPU-passthrough-amd-nvidia#2 where some said to have some success with v460 driver.

I will retry to compile 460 again to see how this goes.

Hey @Latrolage @shabaduu i have managed to fix the problem.

It's because of nvidia-xconfig --query-gpu-info > /dev/null 2>&1 in the teardown script that caused the trouble.

This is my final script that works for me. Notes that i also removed the virsh as well because it works without it.

startup

#!/bin/bash
# Helpful to read output when debugging
set -x

#Stop display manager
systemctl stop gdm3.service

# Unbind VTconsoles
echo 0 > /sys/class/vtconsole/vtcon0/bind


# Unbind EFI-Framebuffer
echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/unbind

# Avoid a Race condition by waiting 2 seconds. This can be calibrated to be shorter or longer if required for your system
sleep 10

# unload nvidia
modprobe -r nvidia_drm
modprobe -r nvidia_uvm
modprobe -r nvidia_modeset
modprobe -r nvidia

# Load VFIO Kernel Module  
modprobe vfio
modprobe vfio-pci
modprobe vfio_iommu_type1

teardown

#!/bin/bash

set -x

modprobe -r vfio_pci
modprobe -r vfio_iommu_type1
modprobe -r vfio

echo "efi-framebuffer.0" > /sys/bus/platform/drivers/efi-framebuffer/bind

# Reload nvidia modules

modprobe nvidia
modprobe nvidia_modeset
modprobe nvidia_uvm
modprobe nvidia_drm

# Restart Display Manager
systemctl start gdm3.service

# Rebind VT consoles
echo 1 > /sys/class/vtconsole/vtcon0/bind