dotnet/runtime

Excessive GC time caused by ConditionalWeakTable entries

luithefirst opened this issue · 9 comments

I'm using a lot of ConditionalWeakTable entries to store computations and runtime data attached to some input data structure. It turned out the total number of entries can get quite large (>500k) and it turned out that they are causing disturbing GC stalls in the application. I could isolate the behavior in a small test project you can find here (using .NET Core 3.1.9): https://github.com/luithefirst/WeakTableGCStall

The procedure adds 10k weak table entries every second and continuously does some computation with a small data allocation that triggers a Gen0 collection every several hundredth iteration. The time of the computation is measured and outliers are reported. The number of outliers matches the value of the "gen-0-gc-count" EventCounter quite well. The magnitude of the outlier increases linearly with the number of weak table entries:

Entries Outlier (GC time)
100k 2.5ms
200k 3.0ms
300k 4.4ms
400k 5.9ms
500k 7.4ms
600k 8.8ms
700k 10.5ms

The actual system does a real-time rendering using the Aardvark Platform (https://github.com/aardvark-platform) where we have an attribute grammar system that caches attributes using a weak table. We also use it to attach callbacks in FSharp.Data.Adaptive (https://github.com/fsprojects/FSharp.Data.Adaptive) to adaptive computations. In an application with a huge scene graph there are random frame time spikes caused by GC stalls due to a large number of weak entries. In a scene that was using about 500k of such entries and a lot of other stuff, the frame time is typically 10ms (i7 8700K + GTX 1070 Ti) / 4ms (i7 4700K + RTX 2080 Ti) when only moving the camera (no heavy computations or new entries) and interrupted by spikes of about 20ms / 40ms.

With the new insight, I will try to reduce the usage of the ConditionalWeakTable, but it would be great if optimization in its internal implementation could be made. I'm looking forward to your input on this issue.

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details
Description:

I'm using a lot of ConditionalWeakTable entries to store computations and runtime data attached to some input data structure. It turned out the total number of entries can get quite large (>500k) and I suspect that this is causing noticeable GC stalls at random points in the application. I could isolate the behavior in a small test project you can find here (using .NET Core 3.1.9): https://github.com/luithefirst/WeakTableGCStall

The procedure adds 10k weak table entries every second and continuously does some computation with a small data allocation that (I assume) triggers a Gen0 collection every several hundredth iteration. The time of the computation is measured and outliers are reported. You can see that the magnitude of the outlier increases linearly with the number of weak table entries:

Entries Outlier (GC time)
100k 2.5ms
200k 3.0ms
300k 4.4ms
400k 5.9ms
500k 7.4ms
600k 8.8ms
700k 10.5ms

The actual system does a real-time rendering using the Aardvark Platform (https://github.com/aardvark-platform) where we have an attribute grammar system that caches attributes using a weak table. We also use it to attach callbacks in FSharp.Data.Adaptive (https://github.com/fsprojects/FSharp.Data.Adaptive) to adaptive computations. In an application with a huge scene graph there are random frame time spikes and I would now conclude that they are caused by GC stalls due to the large number of weak entries. In a scene that was using about 500k of such entries and a lot of other stuff, the spikes are about 20ms when only moving the camera (no heavy computations or new entries).

Of course, with the new insight I will try to reduce the usage of the ConditionalWeakTable, but it would be also great if optimization in its internal implementation could be made. I'm looking forward to your input on this issue.

Author: luithefirst
Assignees: -
Labels:

area-GC-coreclr, tenet-performance, untriaged

Milestone: -

it would be great if this showed the actual GC pause instead of just the count. you could either collect a trace which would tell you definitively how long the GCs are, or use the new GetGCMemoryInfo API in 5.0 which shows the last ephemeral GC pause (you can see examples of using it in its test).

Thanks, the .NET 5 API is very helpful. I've added a GetGCMemoryInfo query to the outlier iterations. After a short startup time, basically with every outlier, there is a new GC Info available. Shortly before reaching 1mio entries, the program output then looks like: Update Time Outlier (Iteration=2662685): 13.8ms (x747, Mean=18us) GCPause: 13.8ms GCIndex=7283, with about 70% time spent in GC.

You can see a huge difference and everything running absolutely smoothly by commenting this line: https://github.com/luithefirst/WeakTableGCStall/blob/master/WeakTableGCStall/Program.cs#L37

thanks for adding the pause time; and thanks @cshung for verifying the behavior.

so I took a brief look. you are hitting basically the worst usage you could get with CWT. there are 2 factors at play here -

  1. your GCs occur very frequently because the survival is very small - you are allocating mostly temporary objects, only 0.5% survived with your list. so we are at the minimal gen0 budget which means GCs will be triggered very often.

  2. each GC's majority cost is going through the CWT. so as the # of elements in your CWT increase this will increase proportionally. so when you get to almost 1 million elements in your CWT this will take a while.

essentially this is a test of going through a CWT because it doesn't really do much else at all (gen0 allocations are very cheap when they are almost pure temporary objects). so yes the GC pauses will be your major cost.

Thanks for your detailed analysis! It confirms some assumptions I had about the GC requiring to check CWT entries in every collection. Yes, the benchmark is the absolute worst case. In our real-time rendering application, the overall GC time is about 3%, but about every second there is a frame time spike when the GC runs. On a system with a fast GPU and a slower CPU, it becomes more noticeable. For some reason, the stalls are longer than I could measure in the benchmark, maybe also the number of CWTs is a factor, but I did not validate the actual GC time with GetGCMemoryInfo. I already avoided some unnecessary CWT use cases, but the majority still remains at the moment. I also did a quick test and temporarily replaced a CWT by a Dictionary and could immediately cut the stall time in half. Unfortunately, the real solution to avoid the CWT and this case and also most other could not be implemented that simple. The CWT allows for very convenient implementation without using "Manager" classes. It would be great if they would not come with this performance drawback.

It looks like this is something that cannot be solved by simply making the CWT checking within the GC faster and requires a new concept of managing those entries. I hope you see our scenario as a valid use case and are able to reduce the overhead of CWTs in the future.

the stalls are longer than I could measure in the benchmark, maybe also the number of CWTs is a factor, but I did not validate the actual GC time with GetGCMemoryInfo.

I would say validating the actual GC time with GetGCMemoryInfo would still be useful so you can confirm your suspicion. the only other thing I would add is if your secondary objects have references, it could potentially make the scanning a lot more expensive ('cause they then make some primary objects live and now we have to scan those again). so if it's feasible to make those be less reference rich it would help.

I hope you see our scenario as a valid use case and are able to reduce the overhead of CWTs in the future.

it's definitely a valid use case. the good news is I already did a design for concurrently scanning dependent handles 'cause I knew it would be a problem for some scenarios (when CWT was added I knew scanning dependent handles could be a real perf problem 'cause it just made using dependent handles too easy and DH scanning was just not something we optimized). when we switch to scanning them concurrently you wouldn't have this problem anymore. the bad news is I can also tell you this will not happen in .NET 6 'cause our .NET 6 schedule is already fully booked... sorry.... 😿 😿

I'm bumping this out of interest, since I'm guessing the planning for .NET 7 has probably started 😃

In the Datadog tracer, there are situation where CWT would be useful, but we're not using it because of the known performance issues (not that we have observed them, but given that we ship as a library we have to consider the worst-case scenario). I'm definitely looking forward to any improvement in that area.

Here is an update from my side.

I've just revisited my stress test on net8.0. The performance characteristics do not seem to have improved, possibly gotten slightly worse (I did not backtest with net 5).

I currently get 18ms GC stalls with 700k entries.