terminatorul/NvStrapsReBar

BSOD 0x00000119 in dxgmms2.sys when resuming from sleep

Closed this issue · 54 comments

System

  • Motherboard: Gigabyte B660 DS3H AX DDR4
  • BIOS Version: F28
  • GPU: KFA 2070 Super
  • CSM is turned off. Make sure to confirm this in the BIOS and not with GPU-Z or similar since it can be inaccurate
  • 4G decoding is enabled. Make sure to confirm this in the BIOS and not with GPU-Z or similar since it can be inaccurate
  • UEFIPatch is applied (see Using UEFIPatch for more information). On some motherboards DSDT Patching is also needed
  • I have read Common issues (and fixes)

Description
When resuming from sleep I always receive BSODs like this one caused in dxgmms2.sys:

012924-17203-01.dmp	29/01/2024 17:14:12		0x00000119	00000000`00000005	ffffe60c`e90ee000	ffffe60c`e91a6820	00000000`0000b2db	watchdog.sys	watchdog.sys+5685	Watchdog Driver	Microsoft® Windows® Operating System	Microsoft Corporation	10.0.22621.2506 (WinBuild.160101.0800)	x64	ntoskrnl.exe+416bc0					C:\WINDOWS\Minidump\012924-17203-01.dmp	20	15	22621	4,997,260	29/01/2024 17:14:52	

Anything to try out?

Sorry I do not know about this problem.

But I see from the base project ReBarUEFI that resume from suspend is sometimes a problem when modding the UEFI firmware image to add ReBAR to the motherboard. @xCuri0 did you find out more about such problems ?

Can you enable ReBAR or your motherboad, then check and see if input value 65 works for you ? (for newer versions option 65 is called System default and has the value 0)

It is the least intrusive option and does not change the PCI configuration on the motherboard side. But this option is not always enough and does not work for all users.

It works as in I can resume again from sleep without BSOD, but ReBAR is also disabled

Sorry, I don't know what else to do about this. Did you also enable ReBAR on the motherboard in UEFI setup ?

@saveli try changing on/off any BIOS settings labeled save PCI configuration or similar.

Sleep works on ReBarUEFI for most users, I think it's related to the BIOS setting I mentioned not correctly saving the extended configuration. Disabling it probably fixes it because OS handles it then. But it's a B660 board here so I don't think that's the issue

@Cancretto @UnidentifiedTag can you try sleep ?

If no one can get it working then it appears that the NVIDIA GPU needs it's straps setup again, it's possible to hook using boot script resume a UEFI DXE driver tho so it can be fixed.

https://uefi.org/specs/PI/1.8/V5_S3_Resume.html

@sid27 @pexcfequinnet sleep works on laptops or not with this ?

@xCuri0 just done testing:
Sleep currently does not work as it threw me with a BSOD with stop code VIDEO_SCHEDULER_INTERNAL_ERROR.
Setting the value to 65 using the program does make it work (ReBar verified enabled with GPU-Z with 256MB BAR).

There is also this EFI error right here if I set the value to 12, which is 4GB here.
image_1

Will be updated further if anything new is found.

@pexcfequinnet to make S3 resume work the module will need to add an S3 resume script which sets up the straps and resize BAR.

I thought the NVIDIA driver might save that stuff itself but it doesn't.

@xCuri0

If no one can get it working then it appears that the NVIDIA GPU needs it's straps setup again, it's possible to hook using boot script resume a UEFI DXE driver tho so it can be fixed.

https://uefi.org/specs/PI/1.8/V5_S3_Resume.html

Do I understand correctly after suspend the GPU straps bits get reset, but the PCI configuration space does not ?
And the UEFI DXE driver needs to re-configure GPU straps during resume ?

@pexcfequinnet

etting the value to 65 using the program does make it work (ReBar verified enabled with GPU-Z with 256MB BAR).

245MB BAR means ReBAR disabled, right ? So option 65 just didn't work ?

@terminatorul

Sorry, I don't know what else to do about this. Did you also enable ReBAR on the motherboard in UEFI setup ?

Yes, above 4g and bar are enabled. Also did try different aperture sizes, doesn't matter.

@xCuri0

@saveli try changing on/off any BIOS settings labeled save PCI configuration or similar.

Only settings that are connected to BIOS / OS I found are USB hands-off and native ASPM. Regarding PCI, there's only link speed, nothing else. So nothing on that side.

@terminatorul there's this when I set the value to 0 (Disabled)
image_3

and this when I set the value to 65
image_4

Sleep works on both value above.

@xCuri0

If no one can get it working then it appears that the NVIDIA GPU needs it's straps setup again, it's possible to hook using boot script resume a UEFI DXE driver tho so it can be fixed.
https://uefi.org/specs/PI/1.8/V5_S3_Resume.html

Do I understand correctly after suspend the GPU straps bits get reset, but the PCI configuration space does not ? And the UEFI DXE driver needs to re-configure GPU straps during resume ?

Yes this is the issue, you won't need to touch PCI configuration space on resume probably because the OS or BIOS will restore it.

DXE drivers do not run on resume, but EDK2 provides an S3 resume script API which I linked before that allows you do various operations such as memory and PCIe read/write operations making it usable for setting up the straps again.

DXE drivers do not run on resume, but EDK2 provides an S3 resume script API which I linked before that allows you do various operations such as memory and PCIe read/write operations making it usable for setting up the straps again.

Thank you for this, I definitely would not have known that.

@saveli
This requires more developement, to add resume support to the driver, but now I am still busy trying to remove the hard-coded values ... so I can't work on the new script now, sorry. Plus, if I can not reproduce this problem on my system, I am not sure I can release a script on github to do this ... it's just a risk for other users. But I will be able reproduce the problem as well, since more people reported it

If I can help, let me know

Warning added to main README page about possible crash on resume...

Warning added to main README page about possible crash on resume...

Worth! won’t go to sleep then 😂

@xCuri0 Do you know if the boot script runs after PCI configuration is restored ?

Do I still need to save / restore the bridge configuration and the GPU BAR0 address in the boot script ?

@terminatorul it runs before to my knowledge.

I think everything besides resizable bar capability has to be redone, that will be restored by the OS most likely tho I'm not 100% sure about it

i had the same problem, maybe has to do with bios having two ffs mpdules?

Hi @xCuri0 , can you have a look at this line for my initial attempt please:

status = gBS->CreateEventEx(EVT_NOTIFY_SIGNAL, TPL_APPLICATION, &PreExitBootServices, NULL, &gEfiEventBeforeExitBootServicesGuid, &eventBeforeExitBootServices);

Do you know why CreateEventEx() would return EFI_STATUS: INVALID_PARAMETER ?

image

Is it ok to simply add this to the .inf file:

?

@terminatorul I think that event was recently added in 2022, atleast it looks like it to me.

I think it's best we stick to UEFI 2.0 functions to maximize compatibility.

You can achieve the same with an ExitBootServices hook, there are many examples for this.

Though I'm not sure if this is needed, because can't you just use the BAR address in the configuration instead of reading it ?

I'm not 100% sure about this but you might find out that the bridge needs to be reconfigured again.

It will again get reconfigured by the OS when it wakes up so you don't need to restore any values.

From what I can find it seems standard for the OS to restore PCI registers by itself. So all that's needed is

  1. Configuring the bridge and BAR0 using the same BAR0 address found in configuration (reading from ExitBootServices is unnecessary)
  2. Configure strap

OS will do restoring Resizable BAR size.

@xCuri0

Yes, that looks complicated. The boot script has opcodes for writing PCI configuration space, using the PCI address (bus, device, function). So I have to wonder how is that done if PCI is not configured ? I mean, at least the bus numbers must be allocated before the boot script can run those opcodes. So does it make sense to run the boot script after PCI bus numbers are allocated, but before other PCI configuration is done ?

It is possible the GPU base address 0 to be different then what I have in the configuration variable. Simply enabling ReBAR can already change the allocation the firmware does, plus adding a new GPU or other device as well, and just changing other options in UEFI Setup would change the GPU BAR0 ... Also Linux can run with pci=realloc option, but I suppose it should be smart enough to leave the firmware out of it

In the worst case scenario I need to make the .ffs a runtime driver, and add a custom function (written in C) to the boot script. Do you know how can I change the DXE driver from boot services to a boot + runtime services driver ?

I read that simply using EXIT_BOOT_SERVICES (and not BEFORE_EXIT_BOOT_SERVICES) runs the risk of finding other components like my boot script already transitioned to the runtime services and no longer in boot services mode. Do you know something about that ? If I hook ExitBootServices, do I have the same problem ?

I guess there are other ways like using the end of PCI enumeration to write the boot script

I just don't know what to expect ... so I cannot make the right decisions

@terminatorul You should be fine using the same values in the NVRAM variable configuration for the S3 script, they work for DXE phase so they will work for S3 phase too.

It doesn't matter if it doesn't match the ones assigned by BIOS, because when the OS resumes (After S3 script has run) they will be restored anyways.

Don't mess with runtime drivers or event/ExitBootServices hook it's completely unnecessary.

Remember I skipped the bridge configuration for now. So I can no longer pick and choose whatever base address works for me, I must get the real one

@terminatorul the bridge isn't initialized in S3 resume before the OS loads, so you will have to configure it yourself anyways.

@terminatorul I think that event was recently added in 2022, atleast it looks like it to me.

I think it's best we stick to UEFI 2.0 functions to maximize compatibility.

Thank you. I was wondering about that, but never got to check it properly

@terminatorul the bridge isn't initialized in S3 resume before the OS loads, so you will have to configure it yourself anyways.

Do you think I have to walk the tree of PCI bridges to assign the bus numbers starting from the root bridge / host bridge ?

@terminatorul no you don't have to do anything like that.

The only bridge that needs configuring is the same one that gets configured by the DXE.

So the bus numbers are already allocated in S3 boot, but not the BARs ?

@terminatorul it's exactly the same as the DXE phase, so you need to do everything you are doing currently.

I don't think bus numbers are assigned, but the only thing that needs bus numbers assigned is the GPU parent bridge which the DXE driver currently assigns.

Ok. I still want to blindly try to access BAR0, to make sure it's not going to work. But I understand it should not work because PCI is not configured.

I removed the assignment for the bridge PCI bus number now (Xelafic suggested I could do that a long time ago). And now I actually check the bridge secondary bus number to be pre-assigned, and to match the GPU bus number, else I show a code in my StatusVar ... I know some users have input the wrong bridge, when it was still hard-coded

@terminatorul If that's the case I guess bus numbers might be assigned in S3 resume too

I switched to the gEfiEventReadyToBootGuid event group, which is old enough, and I still get the same EFI_STATUS code for INVALID PARAMETER. Now something is off

@Felty2562

i had the same problem, maybe has to do with bios having two ffs mpdules?

The real problem is during S3 sleep the GPU cuts its power or resets the hardware configuration as if it lost power. And the straps bits written by the DXE driver to enable ReBAR are now reset after resume.

In other words, after resume ReBAR is suddenly disabled, despite it was enabled at boot. The fix is to re-write the straps bits during resume, to match again the values written during boot. But it is more difficult to do during S3 resume, and the programming documentation is unclear.

I suspect everyone using NvStrapsReBar has this problem.

@xCuri0 Every time I use EFI_S3_SAVE_STATE_PROTOCOL.Write(), my board no longer POSTs and I need to recover it with Q-Flash Plus.

I am sorry, I don't think I can implement this :(

Are there some other conditions I need to meet before using the protocol ? Does it matter how I get the protocol pointer ?

First I copied your method and used gBS->LocateHandleBuffer() + gBS->OpenProtocol(). And then I also tried with gBS->LocateProtocol().

To be clear I do not get an EFI error result (as if the call should work) and I can not even try to suspend (sleep) my computer, because the problem appears during POST at cold boot.

Thank you !

After trying again I found an error in my code: I forgot the double pointer when I query the interface (protocol) pointer. I guess I am used to much to the type safety from C++, which is not the same in C.

Now my system resets right when it enters sleep.

I still have to configure the PCI bar in the S3 boot script

I was loosing hope already, but tried again anyway. And now it appears it works. I can sleep and resume with no error, and keep the large BAR size. However my development system does not stay in sleep and suddely resumes with no user input ... I suspect this is a separate issue with my system, though.

@saveli I have a very crude implementation of this fix for resume on the bleed branch. Do you think you can build it, flash it again an test to see if it fixes BSOD ?

You can find the culprit using powercfg /lastwake. From my experience, for "automatic" wakes it is always the NIC - disallow it to wake it from sleep under device properties / device manager.

I can try flashing later, when I have time. (Maybe it works on your side until then)

@terminatorul i recommend you remove the ExitBootServices hook and instead use the BAR0 value from configuration like usual. No reason it shouldn't work

Because using ExitBootServices hook can set off antivirus and stuff which might cause issues for some users

I guess the exact settings that cold-booted the machine can also be used to resume from sleep.

@xCuri0 any reason you used gBS->LocateHandleBuffer() + gBS->OpenProtocol(), instead of a single gBS->LocateProtocol() ?

Is the later not well supported in your tests or user reports ?

@Cancretto @UnidentifiedTag can you try sleep ?

Sleep works

@terminatorul I think both functions work the same, didn't choose it for any particular reason.

@saveli

You can find the culprit using powercfg /lastwake. From my experience, for "automatic" wakes it is always the NIC - disallow it to wake it from sleep under device properties / device manager.

Thanks. For my case it turns out it is ... my monitor ?

image

The NVIDIA USB-C controller (on RTX 2080 Ti) must be the graphics card USB-C output, which is driving one of my monitors ASUS MG278Q 1440p@144Hz

And now also the NIC, like you said:

image

New commit on branch bleed with option to enable / disable the S3 Resume script, default enabled.

Maybe I can make a release with it one of these days

New release v0.3 add support for the S3 Resume Script by default

Works so far - can resume from sleep and gpu-z shows rebar enabled
Thanks! 👍

Was bit weird, that I did not have to edit adresses in the pci header 😬