KSPModdingLibs/KSPCommunityFixes

Old PQS coroutines are never stopped, causing perf degradation over time

gotmachine opened this issue · 4 comments

Quoting @Gameslinx :

I found a fairly big problem yesterday relating to the PQS.UpdateSphere() coroutine starting without stopping the old one.
There should be at maximum two instances of the coroutine running at once. One for the ocean, and one for the terrain
However, whenever the PQS is restarted or set active, the coroutine is started again but if there was one running before this, it isn't stopped
So as an example, going from the Space Center to Kerbin will spawn an additional 2, making 4 of them running at once
Quicksaving and quickloading on the runway spawns 2 more each time, so I managed to rack up 14 of them running at once
image

Likely cause :

PQS.ResetSphere() is calling StopCoroutine("UpdateSphere").
This is the overload that takes a string arg for a named coroutine, but that only work if the coroutine was started using the matching string overload, which isn't the case, the coroutine is started in PQS.StartSphere() with the IEnumerator arg overload : StartCoroutine(UpdateSphere())

So, this is quite difficult to fix, but with some messing around it's at least possible to avoid all PQS.UpdateSphere() coroutine duplication cases. Eliminating non-duplicated but unneeded coroutines that are still running on other bodies than the main body is a lot more problematic, but they are eventually collected on scene switches.

Main fix involves changing the StartCoroutine() calls from the IEnumerator overload to the string overload in PQS.StartSphere() and PQS.ResetAndWait() methods. This allow the stock StopCoroutine() calls in ResetSphere() and ForceStart() to actually work, and allow us to place additional StopCoroutine() calls.

However, simply fixing the stock bug isn't enough because the stock PQS lifecycle isn't designed at all to ensure only one coroutine exists at a given time. Notably (but not limited to), it seems there is no guaranteed code path at all that could clean up the old coroutines when the PQS is restarted while it was already the active PQS. This notably mean that on scene switches, if the main body doesn't change, a new coroutine is created without stopping the old one. Trying to aggressively kill the coroutines (either from within the coroutine, or externally) when a PQS is not the main body anymore occasionally result in killing the actually needed coroutine. Such solutions seems too risky, there are too many corner cases and delays / order shenanigans in the PQS execution flow.

A poor man alternative is to always call StopCoroutine() prior to the StartCoroutine() call. This ensure there is no duplication happening, it cover the vast majority of cases and should be quite safe. There are still some occasional cases of the previous main body PQS coroutine being kept alive after leaving its SOI, but those will be eventually cleaned up on the next scene switch and PQS.UpdateSphere() has the good taste to skip a good chunk of the processing when the PQS isn't active, so the perf impact is relatively limited.

Hmm, could you inject code into the coroutine itself that detects when it's not needed anymore and ends it?

That's what I meant with "Trying to aggressively kill the coroutines (either from within the coroutine, or externally) when a PQS is not the main body anymore occasionally result in killing the actually needed coroutine."

I haven't found a reliable way to "detect when it's not needed anymore". If I try to self-kill them when the PQS isn't supposed to be active anymore, there are a bunch of corner cases where if the PQS is reactivated latter, it doesn't go through a code path spawning a new coroutine. I don't remember everything I tried, but trying to self-kill the coroutine also face the issue that the coroutine is spawned before the PQS stuff is fully initialized, so you can also end up self-killing the newly spawned coroutine.

Everything I tried didn't work consistently so I gave up because without investigating too long, I have a feeling this would be a pretty deep rabbit hole.

In fact, I think that the main bug causing those coroutines to never be stopped has been around since forever without any functional side effect (although there are definitely occasional "PQS crash, restarting" entries popping up in the logs), so this was hiding the fact that there is a hole in the PQS lifecycle management . Maybe it's possible to patch that hole but I'm not willing to put my hands in this mess.

The committed patch takes care of the "leak" with minimal intervention, it solves the main problem of unrecoverable perf degradation over time. With the patch, worst case scenario (visiting multiple SOIs without a scene switch or reload in between) you have an extra update function consuming ~0.3 ms lying around. And this is not systematic, I'm not even sure it's reproducible without aggressively teleporting from one SOI to another like I was doing for testing.

This has shipped.