`CallbackStack#pack` can double-count removals when called concurrently
armanbilge opened this issue ยท 5 comments
h/t @mernst-github in #3935 (comment). I've captured this as a test case in 6e2e87d, which may non-deterministically fail with:
[error] x handle race conditions in pack
[error] 3 != 2 (CallbackStackSpec.scala:49)
Probably worth carrying over from #3935 that this corrupts the IODeferred clearCounter
and leads to unreliable pack
invocations, effectively a memory leak.
I don't have deep insights into the design of the callback stack, but on first glance I would try to avoid concurrent pack
s altogether (code bails anyway when it detects one), I find it hard to reason about its safety. Guarding pack
at the stack root with an atomic (without atomics for the actual linked stack) sounds more robust to me.
but on first glance I would try to avoid concurrent
pack
s altogether (code bails anyway when it detects one) ... Guardingpack
at the stack root with an atomic (without atomics for the actual linked stack) sounds more robust to me.
Thanks, this is a really interesting idea!! @samspills and I are giving it a try :)
After some reading, it turns out, that it is "well-known" (LOL), that removing from a linked list with a single CAS is incorrect; we just didn't notice it :-). There is some reading about it here and in the gigantic comment in the openjdk ConcurrentSkipListMap. The short version is that fixing it requires marking the next
pointer with a CAS before unlinking the element. (Fun fact: our TimerSkipList already does this, as it is a port of (some of) CSLM.)
We can do the marking for CallbackStack
if we need to. But doing something simpler/faster by using the fact that the CallbackStack
is not a general purpose linked list is probably a good idea (like what you discussed).
This is gone in 3.5.3, thanks!
Thanks for all your help!