pedro00dk/nvidia-exec

nvx still on after quitting and stopping

NicBOMB opened this issue · 13 comments

nvx seems to hang immediately after closing glxgears and providing password for cleanup.
Ran on endeavouros using nvidia dkms 515.43.04 and linux zen kernel in x11.
I have yet to test on other apps or games or wayland.

$ nvx start glxgears
# turn on gpu
-- pci rescan
[sudo] password for user: 
-- pci "PCI bridge - 6th-10th Gen Core Processor PCIe Controller (x16)" -> 0000:00:01.0
   -- pci power control on
   -- device enable "VGA compatible controller - TU116M [GeForce GTX 1660 Ti Mobile]" -> 0000:01:00.0
# load modules
   -- module nvidia
   -- module nvidia_uvm
   -- module nvidia_modeset
   -- module nvidia_drm
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
750 frames in 5.0 seconds = 149.992 FPS
720 frames in 5.0 seconds = 143.965 FPS
720 frames in 5.0 seconds = 143.977 FPS
X connection to :0 broken (explicit kill or server shutdown).
# kill processes
-- no processes found
# unload modules
-- module nvidia_drm
[sudo] password for user: 
-- module nvidia_modeset
-- module nvidia_uvm
-- module nvidia
# turn off
-- pci "PCI bridge - 6th-10th Gen Core Processor PCIe Controller (x16)" -> 0000:00:01.0
   -- device remove "VGA compatible controller - TU116M [GeForce GTX 1660 Ti Mobile]" -> 0000:01:00.0

At this point, nvx status replies on and

$ nvx off-kill
# kill processes
-- kill process nvidia-sm -> 17757
# unload modules
-- module nvidia_drm
[sudo] password for user: 
-- module nvidia_modeset
-- module nvidia_uvm
-- module nvidia
# turn off
-- pci "PCI bridge - 6th-10th Gen Core Processor PCIe Controller (x16)" -> 0000:00:01.0
   -- device remove "VGA compatible controller - TU116M [GeForce GTX 1660 Ti Mobile]" -> 0000:01:00.0
tee: '/sys/bus/pci/devices/0000:01:00.0/remove': Permission denied
   -- power control auto

nvx status still says on after running nvx off and nvx kill.

Hi @NicBOMB,
Could you try reproducing that setup again?
Sometimes other processes may detect Nvidia drivers are enabled and start using it.
Once you close a program started with nvx start (and it hangs), please share the output of the nvx ps command.
That will help me debug.

Hi @NicBOMB, Could you try reproducing that setup again? Sometimes other processes may detect Nvidia drivers are enabled and start using it. Once you close a program started with nvx start (and it hangs), please share the output of the nvx ps command. That will help me debug.

nvx ps echoes nothing on the command line after closing the window for glxgears and entering a password for device removal.
However, nv psx echoes the following

$ nvx psx
user       4056  0.0  0.0   7848  4372 pts/1    S+   14:24   0:00 /bin/bash /usr/bin/nvx start glxgears

Restarting is also failing with the repeated message:

shutdown[1]: Waiting for process: 4888 (tee)

Yeah, that tee process is nvx trying to turn off the gpu.

Could you also post the output of these two commands in the same situation?
lsof /dev/nvidia* and sudo lsof /dev/nvidia*

Yeah, that tee process is nvx trying to turn off the gpu.

Could you also post the output of these two commands in the same situation? lsof /dev/nvidia* and sudo lsof /dev/nvidia*

$ lsof /dev/nvidia*; sudo lsof /dev/nvidia*
[sudo] password for user: 
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
      Output information may be incomplete.
$ lsof /dev/nvidia*
$ sudo lsof /dev/nvidia*
lsof: WARNING: can't stat() fuse.portal file system /run/user/1000/doc
      Output information may be incomplete.
ni-ka commented

Hey @pedro00dk and @NicBOMB

I am having the same behavior but only since a few days (running Manjaro stable so some updates are delayed). The last update 4 days ago updated to kernel 5.18 / gnome 42.2-1 / mesa 22.1.1-2 / nvidia 515.48.07-2.

I have tested booting on kernel 5.17 but same behavior so that doesn't seem to cause it.

BTW I am running on wayland.

ni-ka commented

My bad, it was not related to the updates, but to me adding options nvidia-drm modeset=1 in /etc/modprobe.d/nvidia.conf as I was playing with connecting an external monitor under wayland - which works nice by the way. However it causes the behavior seen above on my system. Maybe worth checking @pedro00dk @NicBOMB ?

My bad, it was not related to the updates, but to me adding options nvidia-drm modeset=1 in /etc/modprobe.d/nvidia.conf as I was playing with connecting an external monitor under wayland - which works nice by the way. However it causes the behavior seen above on my system. Maybe worth checking @pedro00dk @NicBOMB ?

You were right, I had that in my nvidia.conf and removing it resolved the issue. I actually didn't add it to that file myself though, it was envycontrol, which I had installed previously. It specifically added a prebuilt string to my conf. I would have removed it before if I had known it was there. Good catch @ni-ka . Now I'm wondering what else was configured tbh. I will stick to optimus manager and nvx for now. EnvyControl appears to be incompatible, but nvx is better for gpu activation and deactivation automation, whereas envycontrol is a switch.

@ni-ka , I was able to reproduce the issue as well, nice find. I did not find any fixes for it though. Still, for some reason I was able to set the nvidia-drm modeset=1 module option, but directly within nvx while it is loading the kernel modules.
That options seems to be a good thing to do. I saw in some nvidia docs that it is required to allow GBM-Wayland support (which is supported by all wayland compositors, instead of egl streams).

I added that option and a new troubleshooting section to the README about this and other issues.

Some references:
https://download.nvidia.com/XFree86/Linux-x86_64/515.48.07/README/gbm.html
https://download.nvidia.com/XFree86/Linux-x86_64/515.48.07/README/kms.html

https://wiki.archlinux.org/title/wayland#Requirements

Ok, It is not working anymore, probably I messed up my tests. I will remove the modeset=1 option for now and only leave the troubleshooting on the README.

ni-ka commented

@pedro00dk I've investigated this a bit more. modeset=1 allows to use the external monitor to be used on wayland as well (tested in gnome, needed to also activate modeset on 915 module). However, if you use modeset=1 you will have to logoff & kill gdm to be able to only nvidia, thus it is currently not suitable for the start command.

Would it be possible to add a parameter that allows to use modeset=1 only when 'nvx on' is invoked, but not 'nvx start' (where it will not allow to unload)?

Loading nvidia_drm without modeset=1 causes heavy screen tearing for me. But loading it with it, won't allow to unload kernel modules and hangs Xorg video output (freeze) :/

Edit:
Actually I've fixed this one just now. Loading and unloading with modeset=1 works now. And no screen tearing. Just added this file:

$ cat /etc/X11/xorg.conf.d/01-autoadd.conf 
Section "ServerFlags"
        Option "AutoAddGPU" "off"
EndSection

to stop Xorg from auto adding newly scanned nvidia gpu.

ni-ka commented

I am using modeset=1 as well but under gnome wayland these days. i have mostly been using my laptop stationary with HDMI monitor.