Kotlin/kotlinx.coroutines

Potentially harfmul SynchronizedObject usages on iOS platform

qwwdfsad opened this issue · 13 comments

We have a report in public Slack about very slow CMP application: https://kotlinlang.slack.com/archives/C0346LWVBJ4/p1732321550585009
Luckily, the user provided an instrument profile trace, and, it's visible that some huge chunks of work are spent in our synchronization.

Notably, a good chunk (41%) of the time between 02:01 and 02:10 is spent within
kfun:kotlinx.coroutines.internal.LimitedDispatcher.obtainTaskOrDeallocateWorker#internal -> kfun:kotlinx.atomicfu.locks.SynchronizedObject#lock(){}. It requires further investigation, but it is worth taking a look

Instruments file for the history: Hang iPhone SE CMP 1.7.0.trace.zip

Waiting for the specific coroutines version, as atomicfu got a lock upgrade recently

Will this issue be considered solved if LimitedDispatcher is rewritten to be lock-free, or are all usages of locks to be investigated?

Would be nice to understand the root of the issue (it seems like it's LimitedDispatcher peculiarity) and address it instead. I don't think avoiding locks on K/N is justified

I'll double-check, but after a brief refresher on the source code, I see no fault in LimitedDispatcher. Maybe that's the iOS priority inversion issue we've been hearing about. The lock in LimitedDispatcher is only taken for a short time, with no potential for spin-locking on it that I've noticed.

Might be the case with allocation-spamming lock either, I've asked the user about coroutines version

I'm the OP from that Slack conversation, and was asked to provide the details here. I was just able to generate the freeze yet again, and here are the set of Kotlin & related dependencies I'm using. To answer the primary question asked above, I'm using kotlinx-coroutines-core version 1.9.0:

Screenshot 2024-12-02 at 10 45 21 AM

As an update, I have updated my codebase to Kotlin 2.1 and Compose Multiplatform 1.7. Coroutines are still version 1.9.

The original genesis for this issue was my app freezing, but thankfully that has not happened. However, the phone is still getting hot, and profiling with Time Profiler shows large amounts of SynchronizedObject calls. Attached is an image of the profiling. I will provide the updated trace file in the Kotlin Slack.

Screenshot 2024-12-09 at 12 29 35 PM

Thanks! The culprit seems to be https://github.com/Kotlin/kotlinx-atomicfu/blob/bad63e743ac905a298386446a82a88225c2f71fc/atomicfu/src/nativeMain/kotlin/kotlinx/atomicfu/locks/Synchronized.kt#L13-L66. There is a work in progress to replace them with a more efficient implementation on the atomicfu side, so the problem is expected to be solved in future releases. It's not going to be very soon, though, as writing efficient mutex implementations takes a lot of effort.

It doesn't seem like there's evidence of coroutines using the mutex incorrectly, which was the original topic of this issue. Can this be closed?

I suppose, but a few follow up questions:

  • Might I be able to provide my app's codebase to Jetbrains privately so you could use it for testing? I feel like this all especially pushes the limits of atomicfu through extensive use of SharedFlow - required for a websocket. It could be useful to benchmark if the updates improve things.
  • Is there anything in the interim I can do to mitigate this? I can't go live with intermittent app freezes.
  • If this issue is closed, is there any place I can still track updates that will improve this functionality?

Thanks!

Is there anything in the interim I can do to mitigate this? I can't go live with intermittent app freezes.

I thought this no longer happened with newer Kotlin and Compose versions? Is there a reason not to upgrade?

Might I be able to provide my app's codebase to Jetbrains privately so you could use it for testing?

Yes, this would be excellent, thanks! You could share a link / an archive / whatever with me in the kotlinlang Slack (Dmitry Khalanskiy [JB]) or add me to your project on github. I'll share your project with other teams internally if you do.

If this issue is closed, is there any place I can still track updates that will improve this functionality?

Good point, I don't think there's a public Github issue in atomicfu for tracking this. Let's leave this one open, then, and I'll close it if we do open an atomicfu issue.

Thank you for the reply! I've sent you my app's codebase on the Koltinlang Slack.

Unfortunately, it is still happening with Compose 1.7 and Kotlin 2.1. I just profiled my app in release mode and was able to obtain the freeze. It appears most likely to happen when scrolling lists; I'll be scrolling, navigating back and forth, scroll some more, and then freeze.

What's interesting is that the "long" usage of SynchronizedObject.lock() shows up on more threads now - worker threads, child threads and the Main thread.

The file is too big to attach here, so I will send it to you as well.

As another update, as I was scrolling through the same list that can sometimes cause the freezing, I got this Thread Performance Checker message:

Thread Performance Checker: Thread running at User-interactive quality-of-service class waiting on a lower QoS thread running at Default quality-of-service class. Investigate ways to avoid priority inversions
PID: 57512, TID: 2040891
Backtrace
=================================================================
3   DRiP Haus                           0x0000000106148f0c _ZN12SkMessageBusIN5skgpu27UniqueKeyInvalidatedMessageEjLb1EE5Inbox4pollEPN12skia_private6TArrayIS1_Lb0EEE + 148
4   DRiP Haus                           0x0000000106147818 _ZN15GrResourceCache13purgeAsNeededEv + 52
5   DRiP Haus                           0x000000010622bd9c _ZN17GrMtlRenderTargetC1EP8GrMtlGpu7SkISize5sk_spI15GrMtlAttachmentES5_NS_7WrappedENSt3__117basic_string_viewIcNS7_11char_traitsIcEEEE + 288
6   DRiP Haus                           0x000000010622c0f4 _ZN17GrMtlRenderTarget23MakeWrappedRenderTargetEP8GrMtlGpu7SkISizeiPU21objcproto10MTLTexture11objc_object + 468
7   DRiP Haus                           0x0000000106222424 _ZN8GrMtlGpu25onWrapBackendRenderTargetERK21GrBackendRenderTarget + 204
8   DRiP Haus                           0x0000000106139238 _ZN5GrGpu23wrapBackendRenderTargetERK21GrBackendRenderTarget + 156
9   DRiP Haus                           0x000000010614068c _ZN15GrProxyProvider23wrapBackendRenderTargetERK21GrBackendRenderTarget5sk_spIN5skgpu16RefCntedCallbackEE + 92
10  DRiP Haus                           0x00000001061e4a40 _ZN10SkSurfaces23WrapBackendRenderTargetEP18GrRecordingContextRK21GrBackendRenderTarget15GrSurfaceOrigin11SkColorType5sk_spI12SkColorSpaceEPK14SkSurfacePropsPFvPvESD_ + 484
11  DRiP Haus                           0x000000010653e4a0 org_jetbrains_skia_Surface__1nMakeFromBackendRenderTarget + 124
12  DRiP Haus                           0x0000000105ed55e0 kfun:org.jetbrains.skia.Surface.Companion#makeFromBackendRenderTarget(org.jetbrains.skia.DirectContext;org.jetbrains.skia.BackendRenderTarget;org.jetbrains.skia.SurfaceOrigin;org.jetbrains.skia.SurfaceColorFormat;org.jetbrains.skia.ColorSpace?;org.jetbrains.skia.SurfaceProps?){}org.jetbrains.skia.Surface? + 756
13  DRiP Haus                           0x000000010697c568 kfun:androidx.compose.ui.window.MetalRedrawer.draw#internal + 7044
14  DRiP Haus                           0x000000010697e994 kfun:androidx.compose.ui.window.MetalRedrawer.<init>$lambda$1#internal + 472
15  DRiP Haus                           0x0000000106980d8c kfun:androidx.compose.ui.window.MetalRedrawer.$<init>$lambda$1$FUNCTION_REFERENCE$1.invoke#internal + 72
16  DRiP Haus                           0x0000000106980e5c kfun:androidx.compose.ui.window.MetalRedrawer.$<init>$lambda$1$FUNCTION_REFERENCE$1.$<bridge-DNN>invoke(){}#internal + 72
17  DRiP Haus                           0x0000000105b4c710 kfun:kotlin.Function0#invoke(){}1:0-trampoline + 100
18  DRiP Haus                           0x0000000106981d34 kfun:androidx.compose.ui.window.DisplayLinkProxy.handleDisplayLinkTick#internal + 152
19  DRiP Haus                           0x0000000106981de8 kfun:androidx.compose.ui.window.DisplayLinkProxy.$imp:handleDisplayLinkTick#internal + 144
20  QuartzCore                          0x0000000189dc34ac _ZN2CA7Display15DisplayLinkItem9dispatch_ERNS_8SignPost8IntervalILNS2_11CAEventCodeE835322056EEE + 44
21  QuartzCore                          0x0000000189dc46d4 _ZN2CA7Display11DisplayLink14dispatch_itemsEyyy + 804
22  QuartzCore                          0x0000000189dc42a0 _ZN2CA7Display11DisplayLink8callbackEP15_CADisplayTimeryyybPv + 632
23  QuartzCore                          0x0000000189ecab9c _ZL22display_timer_callbackP12__CFMachPortPvlS1_ + 336
24  CoreFoundation                      0x00000001803b84e4 __CFMachPortPerform + 172
25  CoreFoundation                      0x00000001803eeddc __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 56
26  CoreFoundation                      0x00000001803ee3ac __CFRunLoopDoSource1 + 496
27  CoreFoundation                      0x00000001803e89bc __CFRunLoopRun + 2176
28  CoreFoundation                      0x00000001803e7d28 CFRunLoopRunSpecific + 572
29  GraphicsServices                    0x000000018e7cdbc0 GSEventRunModal + 160
30  UIKitCore                           0x00000001852bafdc -[UIApplication _run] + 868
31  UIKitCore                           0x00000001852bec54 UIApplicationMain + 124
32  SwiftUI                             0x00000001c4b04524 OUTLINED_FUNCTION_70 + 500
33  SwiftUI                             0x00000001c4b043c4 OUTLINED_FUNCTION_70 + 148
34  SwiftUI                             0x00000001c4816108 OUTLINED_FUNCTION_2 + 92
35  DRiP Haus                           0x00000001047a5e80 $s9DRiP_Haus6iOSAppV5$mainyyFZ + 40
36  DRiP Haus                           0x00000001047a5f30 main + 12

Thread running at User-interactive quality-of-service class waiting on a lower QoS thread running at Default quality-of-service class. Investigate ways to avoid priority inversions

We've heard that mutex implementations like our one can suffer from QoS problems (Kotlin/kotlinx-atomicfu#462), and this looks like solid proof. Thanks for the codebase! I'm sure it will be invaluable in our investigation.