openSUSE/SUSEPrime

Query: PCI Express Runtime Power Management and spurious PME interrupts

onelittlehope opened this issue · 7 comments

With prime-select nvidia enabled, I was getting intermittent 1s - 1.5s freezes (entire X session pauses including mouse, whilst my ssh session at the time was responsive) on my laptop and at the same time that the issue occurs, I saw that a kernel message was logged into the systemd journal which looked something like:

pcieport 0000:00:01.0: PME: Spurious native interrupt!

I mitigated the freezing issue by commenting out the following lines in the SUSEPrime udev rules:
https://github.com/openSUSE/SUSEPrime/blob/master/90-nvidia-udev-pm-G05.rules#L10-L16

The above lines seems to be enabling runtime power management when an nvidia card is in use.

Is that correct behaviour?

If a choice has been made to use the discrete card then its performance should trump its power usage. Especially if enabling runtime power management comes with the risk of causing breaking behaviour.

Hmm. This should only be relevant in intel mode when using PRIME Render Offload. Then this is needed for NVIDIA power off support (with Turing GPU and later).

If I understood you correctly, then there should be some logic in the the prime-select.sh script or some manual instruction which performs the task of copying the 90-nvidia-udev-pm-G05.rules file to either the /etc/udev/rules.d/ or /usr/lib/udev/rules.d/ folders when a user runs prime-select intel or prime-select intel2.

Is that correct?

If so, I can raise a bug report against the suse-prime package in openSUSE since currently they copy the 90-nvidia-udev-pm-G05.rules file into the /usr/lib/udev/rules.d/ folder as part of the RPM package install.

Line 76 in the package spec: (https://build.opensuse.org/package/view_file/openSUSE:Factory/suse-prime/suse-prime.spec?expand=1):

install -m 0644 90-nvidia-udev-pm-G05.rules %{buildroot}/usr/lib/udev/rules.d

I'll request them not to copy the file across which will prevent any freezing issues when someone uses prime-select nvidia.

Regarding SUSEPrime itself, would it be possible to add some logic to the prime-select.sh to copy the 90-nvidia-udev-pm-G05.rules file to either the /etc/udev/rules.d/ or /usr/lib/udev/rules.d/ folders as part of running prime-select intel or prime-select intel2 or is this an optional step for users ?

Makes sense, but would need more changes. In particular dracut would be needed to run afterwards and the machine a reboot. :-(

I would like to ask you to check first, if a change to

Option "NVreg_DynamicPowerManagement=0x01"

in /etc/modprobe.d/09-nvidia-modprobe-pm-G05.conf already fixes this issue. After changing
this you need to run "mkinitrd" or "dracut -f" and reboot the machine in order to make sure that
the nvidia module gets loaded with this parameter change.

The suse-prime openSUSE package installs a file called /etc/modprobe.d/09-nvidia-modprobe-pm-G05.conf which has the following contents:

options nvidia NVreg_DynamicPowerManagement=0x02

I've:

  • changed NVreg_DynamicPowerManagement to 0x1 in the /etc/modprobe.d/09-nvidia-modprobe-pm-G05.conf file
  • reverted my changes to /usr/lib/udev/rules.d/90-nvidia-udev-pm-G05.rules
  • ran mkinitrd which seems to call /usr/bin/dracut --logfile /var/log/YaST2/mkinitrd.log --force /boot/initrd-5.6.2-1-default 5.6.2-1-default
  • and confirmed that prime-select nvidia was selected before rebooting the laptop

I can confirm that the freezing issue does not happen any more.

From looking at the dynamicpowermanagement.html link, with NVreg_DynamicPowerManagement=0x02 I think the NVIDIA driver was incorrectly powering down the GPU even when the GPU was driving a display and hence causing a freeze when applications next tried to use the GPU.

As such, I am happy to leave the udev rules in place and use NVreg_DynamicPowerManagement=0x1 with the prime-select nvidia profile.

It would be good if options nvidia NVreg_DynamicPowerManagement=0x02 could be used with the prime-select intel profile since this would cause the GPU to be actually powered off instead of going into a low power state.

Are you already using a Turing GPU?

The output of inxi -G reports the following and so yes I am using a Turing GPU:

Graphics:  Device-1: Intel UHD Graphics 630 driver: i915 v: kernel 
           Device-2: NVIDIA TU106M [GeForce RTX 2060 Mobile] driver: nvidia v: 440.82 
           Display: server: X.org 1.20.8 driver: modesetting,nvidia unloaded: intel tty: 171x45 
           Message: Advanced graphics data unavailable in console for root. 

Thanks for detailed information and explanation! Very helpful! So indeed we probably have the first user owning a NVIDIA Turing GPU in his laptop and using suse-prime. :-)

Changing the NVreg_DynamicPowerManagement value in prime-select would require a rebuild of init via dracut. Honestly I'm rather reluctant to open such a can of worms in suse-prime.

I'm going to change the default value to 0x01, since this appears to be a safe choice for now and
for everyone in intel and nvidia mode.

Fixed in current git and release 0.7.11.