Kotlin/kotlinx.coroutines

UndispatchedCoroutine leaks CoroutineContext via thread local when using custom ContinuationInterceptor

pyricau opened this issue · 4 comments

Bug description

To fix #2930 (thread locals using ThreadContextElement not cleaned up), #3252 introduce a little hack (source) in UndispatchedCoroutine which immediately saves the previous thread context to UndispatchedCoroutine.threadStateToRecover within the UndispatchedCoroutine constructor, only when the ContinuationInterceptor is not a CoroutineDispatcher:

internal actual class UndispatchedCoroutine<in T>actual constructor () : ScopeCoroutine<T>() {
    init {
        if (uCont.context[ContinuationInterceptor] !is CoroutineDispatcher) {
            val values = updateThreadContext(context, null)
            restoreThreadContext(context, values)
            saveThreadContext(context, values)
        }
    }

If the continuation doesn't suspend, then UndispatchedCoroutine.afterResume() isn't invoked and UndispatchedCoroutine.threadStateToRecover.remove() is not called.

One thing of note, the javadoc for UndispatchedCoroutine.threadStateToRecover claims this:

     * Each access to Java's [ThreadLocal] leaves a footprint in the corresponding Thread's `ThreadLocalMap`
     * that is cleared automatically as soon as the associated thread-local (-> UndispatchedCoroutine) is garbage collected.

Unfortunately, that's not exactly correct. When the ThreadLocal is garbage collected, its value actually stays in the ThreadLocalMap, until either the thread itself gets garbage collected, or the ThreadLocal map cleans up stale entries, which only happens when using thread locals some more.

So when the right conditions apply, the CoroutineContext instance ends up being held in memory long after it's no longer needed.

To reproduce this, you need:

  • A CoroutineContext that contains a ContinuationInterceptor that is not a CoroutineDispatcher (to trigger the UndispatchedCoroutine constructor hack)
  • Call withContext(), changing the CoroutineContext but without changing the ContinuationInterceptor (to trigger the withContext() fast path 2 that starts the coroutine on the same ContinuationInterceptor, wrapping it with UndispatchedCoroutine.

Here's a repro: #4295

package kotlinx.coroutines

import kotlinx.coroutines.testing.*
import org.junit.Test
import java.lang.ref.*
import kotlin.coroutines.*
import kotlin.test.*

class CustomContinuationInterceptorTest : TestBase() {

    fun `CoroutineDispatcher not suspending does not leak CoroutineContext`() =
        ensureCoroutineContextGCed(Dispatchers.Default, suspend = false)

    @Test
    fun `CoroutineDispatcher suspending does not leak CoroutineContext`() =
        ensureCoroutineContextGCed(Dispatchers.Default, suspend = true)


    @Test
    fun `CustomContinuationInterceptor suspending does not leak CoroutineContext`() =
        ensureCoroutineContextGCed(
            CustomContinuationInterceptor(Dispatchers.Default),
            suspend = true
        )

    // This is the one test that fails
    @Test
    fun `CustomContinuationInterceptor not suspending leaks CoroutineContext`() =
        ensureCoroutineContextGCed(
            CustomContinuationInterceptor(Dispatchers.Default),
            suspend = false
        )

    @Test
    fun `CustomNeverEqualContinuationInterceptor suspending does not leak CoroutineContext`() =
        ensureCoroutineContextGCed(
            CustomNeverEqualContinuationInterceptor(Dispatchers.Default),
            suspend = true
        )

    @Test
    fun `CustomNeverEqualContinuationInterceptor not suspending does not leak CoroutineContext`() =
        ensureCoroutineContextGCed(
            CustomNeverEqualContinuationInterceptor(Dispatchers.Default),
            suspend = false
        )


    private fun ensureCoroutineContextGCed(coroutineContext: CoroutineContext, suspend: Boolean) {
        runTest {
            lateinit var ref: WeakReference<CoroutineName>
            val job = GlobalScope.launch(coroutineContext) {
                val coroutineName = CoroutineName("Yo")
                ref = WeakReference(coroutineName)
                withContext(coroutineName) {
                    if (suspend) {
                        delay(1)
                    }
                }
            }
            job.join()

            System.gc()
            assertNull(ref.get())
        }
    }

}

class CustomContinuationInterceptor(private val delegate: ContinuationInterceptor) :
    AbstractCoroutineContextElement(ContinuationInterceptor), ContinuationInterceptor {

    override fun <T> interceptContinuation(continuation: Continuation<T>): Continuation<T> {
        return delegate.interceptContinuation(continuation)
    }
}

class CustomNeverEqualContinuationInterceptor(private val delegate: ContinuationInterceptor) :
    AbstractCoroutineContextElement(ContinuationInterceptor), ContinuationInterceptor {

    override fun <T> interceptContinuation(continuation: Continuation<T>): Continuation<T> {
        return delegate.interceptContinuation(continuation)
    }

    override fun equals(other: Any?) = false
}

Fix

I see a few options:

  • Implement #3253 which presumably means the hack can be removed.
  • Provide users with a way to escape the if (uCont.context[ContinuationInterceptor] !is CoroutineDispatcher) { check somehow (e.g. introduce an additional funky interface..)

Workaround

Override equals in any custom ContinuationInterceptor implementation and have it return always false, to skip the fast way that's creating the UndispatchedCoroutine creation.

How we found this

I ran into some Compose related memory leaks in Square's UI tests back in February, asked the Compose team if they knew anything about that. They didn't. Then yesterday folks from the Compose team reached out because they're enabling leak detection in their test suite and were running into the exact same leak. We worked together on analyzing a heap dump and debugging and eventually root cause it to this bug.

The Compose team is looking into whether they can start using CoroutineDispatcher implementations directly in Compose tests (for other reasons as well, e.g. see this issue)

Here's the leaktrace as reported by leakcanary:

====================================
HEAP ANALYSIS RESULT
====================================
1 APPLICATION LEAKS

References underlined with "~~~" are likely causes.
Learn more at https://squ.re/leaks.

44298 bytes retained by leaking objects
Signature: c4797f2a268bbe8b50072affe05b2f82ce29f5ed
┬───
│ GC Root: Thread object
│
├─ java.lang.Thread instance
│    Leaking: NO (the main thread always runs)
│    Thread name: 'main'
│    ↓ Thread.threadLocals
│             ~~~~~~~~~~~~
├─ java.lang.ThreadLocal$ThreadLocalMap instance
│    Leaking: UNKNOWN
│    Retaining 446.6 kB in 9221 objects
│    ↓ ThreadLocal$ThreadLocalMap.table
│                                 ~~~~~
├─ java.lang.ThreadLocal$ThreadLocalMap$Entry[] array
│    Leaking: UNKNOWN
│    Retaining 446.6 kB in 9220 objects
│    ↓ ThreadLocal$ThreadLocalMap$Entry[2]
│                                      ~~~
├─ java.lang.ThreadLocal$ThreadLocalMap$Entry instance
│    Leaking: UNKNOWN
│    Retaining 444.0 kB in 9136 objects
│    ↓ ThreadLocal$ThreadLocalMap$Entry.value
│                                       ~~~~~
├─ kotlin.Pair instance
│    Leaking: UNKNOWN
│    Retaining 444.0 kB in 9135 objects
│    ↓ Pair.first
│           ~~~~~
├─ kotlin.coroutines.CombinedContext instance
│    Leaking: UNKNOWN
│    Retaining 444.0 kB in 9134 objects
│    ↓ CombinedContext.left
│                      ~~~~
├─ kotlin.coroutines.CombinedContext instance
│    Leaking: UNKNOWN
│    Retaining 442.9 kB in 9112 objects
│    ↓ CombinedContext.left
│                      ~~~~
├─ kotlin.coroutines.CombinedContext instance
│    Leaking: UNKNOWN
│    Retaining 442.9 kB in 9111 objects
│    ↓ CombinedContext.element
│                      ~~~~~~~
├─ kotlinx.coroutines.internal.ScopeCoroutine instance
│    Leaking: UNKNOWN
│    Retaining 441.0 kB in 9040 objects
│    ↓ ScopeCoroutine.uCont
│                     ~~~~~
├─ androidx.compose.foundation.gestures.DefaultScrollableState$scroll$2 instance
│    Leaking: UNKNOWN
│    Retaining 441.0 kB in 9038 objects
│    Anonymous subclass of kotlin.coroutines.jvm.internal.SuspendLambda
│    ↓ DefaultScrollableState$scroll$2.$block
│                                      ~~~~~~
├─ androidx.compose.foundation.gestures.ScrollingLogic$scroll$2 instance
│    Leaking: UNKNOWN
│    Retaining 104 B in 2 objects
│    Anonymous subclass of kotlin.coroutines.jvm.internal.SuspendLambda
│    ↓ ScrollingLogic$scroll$2.this$0
│                              ~~~~~~
├─ androidx.compose.foundation.gestures.ScrollingLogic instance
│    Leaking: UNKNOWN
│    Retaining 796 B in 33 objects
│    ↓ ScrollingLogic.overscrollEffect
│                     ~~~~~~~~~~~~~~~~
├─ androidx.compose.foundation.AndroidEdgeEffectOverscrollEffect instance
│    Leaking: UNKNOWN
│    Retaining 484 B in 12 objects
│    ↓ AndroidEdgeEffectOverscrollEffect.edgeEffectWrapper
│                                        ~~~~~~~~~~~~~~~~~
├─ androidx.compose.foundation.EdgeEffectWrapper instance
│    Leaking: UNKNOWN
│    Retaining 56 B in 1 objects
│    context instance of androidx.activity.ComponentActivity with mDestroyed = true
│    ↓ EdgeEffectWrapper.context
│                        ~~~~~~~
╰→ androidx.activity.ComponentActivity instance
     Leaking: YES (ObjectWatcher was watching this because androidx.activity.ComponentActivity received Activity#onDestroy() callback and Activity#mDestroyed is true)
     Retaining 44.3 kB in 797 objects
     key = 6bdc725d-e124-437c-895d-437bf03213af
     watchDurationMillis = 5182
     retainedDurationMillis = 174
     mApplication instance of android.app.Application
     mBase instance of android.app.ContextImpl

For anyone with access to the ASG slack, here's the investigation thread: https://androidstudygroup.slack.com/archives/CJH03QASH/p1733875296139909?thread_ts=1708204699.328019&cid=CJH03QASH

Thanks for a really comprehensive and easy-to-follow bug report!

The issue is clear, and the potential solution is easy to implement (though hard to accept :)) and is hidden in the problem statement: If the continuation doesn't suspend, then UndispatchedCoroutine.afterResume() isn't invoked.

The undispatched handling path should just be aware of that -- either by having an explicit cleanup function in ScopedCoroutine` or by if'ing the actual implementation

I have a prototype that passes your tests (again, thanks!), I'll polish it up tomorrow morning and open a PR. It should make it to 1.10 coroutines with Kotlin 2.1

Thank you for fixing!!

Could I request a backport for 1.8.x, we'd like to pick up this fix in compose without forcing our clients to kotlin 2.0?