firelzrd/bore-scheduler

The Witcher 3 hangs

Kron4ek opened this issue · 14 comments

The Witcher 3 (v1.31) running via Wine hangs with BORE scheduler. Sometimes this happens sooner, sometimes later, but usually this takes no longer than 1 minute to happen.

When it hangs it also seems to prevent new processes from working correctly. For example, i tried running free and pgrep after the game hanged, and they also hanged indifinitely until i switched to another TTY and killed the game from there.

Disabling BORE via sysctl kernel.sched_bore=0 fixes the issue.

Here is the video of how it looks like: https://youtu.be/fQBmaU5yG2o

Let me know if need to provide more info or try something.

My system:

CPU Intel Pentium G4620 3.7 GHz
Linux 6.4.7 with BORE 3.0.0 (compiled with Clang 15, Full LTO and O3)
Wine-Staging 8.13
DXVK 98f3887-git

Thank you so much for the report.
That hang-up looks to be a serious show stopper. I'll be working on it in a top priority.
It's interesting that when the program has hung, some related functionality is also affected, and some other tasks like switching between TTY still works. That may be a hint.
To make sure I got you correctly, can I ask you if the program hangs up "forever" or may it eventually come back if you wait?

Okay, I installed & played The Witcher 3 on my rig straight through for like 30 minutes without any problem.
Maybe there's some precondition (independent from the BORE scheduler itself) to reproduce the problem.
Maybe WINE-staging, maybe DXVK, I don't know. I heard recentely Linux 6.4 series suffered from some serious graphics issue.
Here's my setup. Please feel free to tell me when you have any ideas to share.

AMD Ryzen 7 4800U with Radeon Graphics
Linux 6.3.10 with CachyOS patchset, BORE 3.0.1, GCC 11.4.0
GNOME 3.30
Lutris-GE-Proton8-12
DXVK 1.10.3

To make sure I got you correctly, can I ask you if the program hangs up "forever" or may it eventually come back if you wait?

First it unhanged after like 30 seconds, but after a few seconds freezed again, then i waited for around 5 minutes and it still didn't unfreeze. Switching to another TTY and then back to the first one usually makes the game to come back, but then it hangs again after a few seconds.

My findings so far are:

  • Using Lutris-GE-Proton8-12 and DXVK 1.10.3 does not help
  • Running the game without DXVK does not help
  • Disabling FSYNC and ESYNC does not help
  • Recompiling kernel with gcc and O2 does not help
  • Disabling hyper-threading (SMT) does not help and makes the problem even worse, even switching TTY does not work in this case

I'll try to reproduce the issue on kernel 6.3.

Interesting...
I'll try 6.4.7-based kernel (built with Clang) and see if there's any difference.
Thanks for your cooperation.

Follow-up:
No hangup was observed with kernel 6.4.7 either.
Can I tell me what your graphics card is?

I tried 6.3.13 with BORE 3.0.1 and the issue occurs in this case too, unfortunately. Also tried BORE 2.5.3 and BORE 2.4.2 and the issue persists. And i also experience it on linux 6.1.40 with BORE 2.5.3 (linux-cachyos-lts). I'll try even older BORE versions.

Can I tell me what your graphics card is?

Radeon RX 470.

Ok, so i tested more BORE versions: 1.7.14, 2.0.1, 2.1.1, 2.2.8, they all have this issue, which at least means it's not a regression. Other interesting findings:

  • When the game hangs, it maxes out all CPU threads, in my case it uses all 4 threads - 400% CPU. And it continues to do so until i terminate it. In normal conditions it uses only half of that.
  • I found out that the issue is more easy to reproduce when running the game with only 2 cores and higher priority:
    $ nice -n -20 taskset -c 0,1 wine game.exe
    
  • As i mentioned earlier, the game prevents new processes from working when it hangs, usually only within the same TTY, but sometimes even switching TTYs breaks. But when it's limited to only 2 or 3 threads, new processes do work fine.

The last point is especially weird. It seems like the game process is treated as a realtime non-preemptible process or something, even though it's certainly SCHED_NORMAL. It uses all cpu time and new processes do not get it, at least that's how it looks. And the issue is not reproducible with nice -n -20 stress -c 4.

Thank you for testing so many cases.

Radeon RX 470.

It's a Polaris 10 graphics, but since the issue also happens on older kernels, it can't be the 6.4-specific graphics issue which has recently been discussed.

I tried 6.3.13 with BORE 3.0.1 and the issue occurs in this case too, unfortunately. Also tried BORE 2.5.3 and BORE 2.4.2 and the issue persists. And i also experience it on linux 6.1.40 with BORE 2.5.3 (linux-cachyos-lts). I'll try even older BORE versions.

Ok, so i tested more BORE versions: 1.7.14, 2.0.1, 2.1.1, 2.2.8, they all have this issue, which at least means it's not a regression. Other interesting findings:

  • When the game hangs, it maxes out all CPU threads, in my case it uses all 4 threads - 400% CPU. And it continues to do so until i terminate it. In normal conditions it uses only half of that.
  • I found out that the issue is more easy to reproduce when running the game with only 2 cores and higher priority:
    $ nice -n -20 taskset -c 0,1 wine game.exe
    
  • As i mentioned earlier, the game prevents new processes from working when it hangs, usually only within the same TTY, but sometimes even switching TTYs breaks. But when it's limited to only 2 or 3 threads, new processes do work fine.

The last point is especially weird. It seems like the game process is treated as a realtime non-preemptible process or something, even though it's certainly SCHED_NORMAL. It uses all cpu time and new processes do not get it, at least that's how it looks. And the issue is not reproducible with nice -n -20 stress -c 4.

From my past experiences, such "prevents of executing new process" is usually observed when related to kernel threads blocking other processes' resource access like I/O. For example:

  • Btrfs transaction kthread is writing dirty pages
  • kswapd is trying to flush pages into disks

Your detailed analysis gives me an interesting insight to the problem.
Regarding those facts, I'll play around it. to hopefully find something.

Good news, i managed to reproduce the issue without the game. Running sched_yield with stress-ng prevents new processes from working when BORE is enabled, but when it's disabled this issue does not occur.

$ stress-ng -y 4

I'm not exactly sure, but i think The Witcher 3 is also doing sched_yield before it freezes and during the freeze. To max out CPU threads and do yeilding:

$ stress-ng -c 4 -y 4

That's nice. How about:
$ sudo sysctl -w kernel.sched_burst_smoothness_down=3

Will it still freeze?

Yes, still freezes.

Okay, that's a good hint. I've got an idea.
Let me come back with an experimental patch later.
Since I got a business meeting from now, it should take an hour or two maybe, 'til the patch arrives.
Thank you for the support. You're really helping.

Fixed. (v3.1.0)
Please try it and let me know what you think.

It is fixed indeed, i can't reproduce the issue on 3.1.0, both with the game and with stress-ng. Thank you.

Thank YOU very much for all the devoted cooperation :)