typelevel/cats-effect

`CallbackStack#pack` can double-count removals when called concurrently

armanbilge opened this issue ยท 5 comments

h/t @mernst-github in #3935 (comment). I've captured this as a test case in 6e2e87d, which may non-deterministically fail with:

[error]   x handle race conditions in pack
[error]    3 != 2 (CallbackStackSpec.scala:49)

Probably worth carrying over from #3935 that this corrupts the IODeferred clearCounter and leads to unreliable pack invocations, effectively a memory leak.
I don't have deep insights into the design of the callback stack, but on first glance I would try to avoid concurrent packs altogether (code bails anyway when it detects one), I find it hard to reason about its safety. Guarding pack at the stack root with an atomic (without atomics for the actual linked stack) sounds more robust to me.

but on first glance I would try to avoid concurrent packs altogether (code bails anyway when it detects one) ... Guarding pack at the stack root with an atomic (without atomics for the actual linked stack) sounds more robust to me.

Thanks, this is a really interesting idea!! @samspills and I are giving it a try :)

durban commented

After some reading, it turns out, that it is "well-known" (LOL), that removing from a linked list with a single CAS is incorrect; we just didn't notice it :-). There is some reading about it here and in the gigantic comment in the openjdk ConcurrentSkipListMap. The short version is that fixing it requires marking the next pointer with a CAS before unlinking the element. (Fun fact: our TimerSkipList already does this, as it is a port of (some of) CSLM.)

We can do the marking for CallbackStack if we need to. But doing something simpler/faster by using the fact that the CallbackStack is not a general purpose linked list is probably a good idea (like what you discussed).

This is gone in 3.5.3, thanks!

Thanks for all your help!