Model checking deadlock
durban opened this issue · 5 comments
I think I found a bug, which I can't minimize, but can always reproduce. To reproduce it, these steps are necessary (git and sbt are needed):
git clone https://github.com/durban/choam.git
cd choam
git checkout 998d514d
sbt "stressLinchk/testOnly dev.tauri.choam.data.TtrieModelTest"
Unfortunately there are a lot of things in that repo, and I haven't succeeded in reproducing the issue with a small test. The test in question is here: https://github.com/durban/choam/blob/998d514d88d4296db8f616a8ff68844c9874ca16/stress/stress-linchk/src/test/scala/dev/tauri/choam/data/TtrieModelTest.scala.
With this test, I get the following output:
==> X dev.tauri.choam.data.TtrieModelTest.Model checking test of Ttrie 64.609s org.jetbrains.kotlinx.lincheck.LincheckAssertionError:
= The execution has hung, see the thread dump =
| ------------------------------------ |
| Thread 1 | Thread 2 |
| ------------------------------------ |
| insert(s, qVm) | |
| ------------------------------------ |
| insert(eB, Fkaa) | insert(hQ, tQmWT) |
| ------------------------------------ |
Thread-0:
jdk.internal.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
java.lang.Thread.run(Thread.java:840)
Thread-1:
dev.tauri.choam.internal.mcas.emcas.EmcasThreadContextBase.recordCommitPlain(EmcasThreadContextBase.java:71)
dev.tauri.choam.internal.mcas.emcas.EmcasThreadContext.recordCommit(EmcasThreadContext.scala:150)
dev.tauri.choam.core.Rxn$InterpreterState.loop(Rxn.scala:1269)
dev.tauri.choam.core.Rxn$InterpreterState.interpretWithContext(Rxn.scala:1550)
dev.tauri.choam.core.Rxn$InterpreterState.interpret(Rxn.scala:1543)
dev.tauri.choam.core.Rxn.unsafePerform(Rxn.scala:260)
dev.tauri.choam.data.TtrieModelTest$TestState.insert(TtrieModelTest.scala:60)
java.lang.Thread.run(Thread.java:840)
at org.jetbrains.kotlinx.lincheck.LinChecker.check(LinChecker.kt:38)
at org.jetbrains.kotlinx.lincheck.LinChecker$Companion.check(LinChecker.kt:144)
at org.jetbrains.kotlinx.lincheck.LinChecker.check(LinChecker.kt)
at dev.tauri.choam.data.TtrieModelTest.$anonfun$new$1(TtrieModelTest.scala:41)
at dev.tauri.choam.BaseLinchkSpec.$anonfun$test$2(BaseLinchkSpec.scala:41)
With some debugging, I could confirm, that the test times out here: https://github.com/JetBrains/lincheck/blob/lincheck-2.23/src/jvm/main/org/jetbrains/kotlinx/lincheck/runner/ParallelThreadsRunner.kt#L288. However, increasing the timeout doesn't solve the issue. The 2 threads in question just seem to wait indefinitely: one of them (Thread-0
above) is blocked in LockSupport.park
, the other (Thread-1
above) is spinning in ManagedStrategy.awaitTurn
(ManagedStrategy.kt:406). The latter one is not visible above, but I've attached a full thread dump (threaddump.txt), here is the relevant part:
"FixedActiveThreadsExecutor@4577519-1" #25 prio=5 os_prio=0 cpu=8093.98ms elapsed=9.96s tid=0x00007f37d4523410 nid=0x3e45 runnable [0x00007f38128f6000]
java.lang.Thread.State: RUNNABLE
at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.awaitTurn(ManagedStrategy.kt:406)
at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.switchCurrentThread(ManagedStrategy.kt:416)
at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.switchCurrentThread$default(ManagedStrategy.kt:413)
at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.newSwitchPointRegular(ManagedStrategy.kt:325)
at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.newSwitchPoint(ManagedStrategy.kt:303)
at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.beforeSharedVariableRead$lincheck(ManagedStrategy.kt:489)
at dev.tauri.choam.internal.mcas.emcas.GlobalContextBase.getCommitTs(GlobalContextBase.java:43)
at dev.tauri.choam.internal.mcas.emcas.Emcas.retrieveFreshTs(Emcas.scala:676)
at dev.tauri.choam.internal.mcas.emcas.Emcas.MCAS(Emcas.scala:611)
at dev.tauri.choam.internal.mcas.emcas.Emcas.tryPerformDebug(Emcas.scala:710)
at dev.tauri.choam.internal.mcas.emcas.Emcas.tryPerformInternal(Emcas.scala:704)
at dev.tauri.choam.internal.mcas.emcas.EmcasThreadContext.tryPerformInternal(EmcasThreadContext.scala:119)
at dev.tauri.choam.internal.mcas.Mcas$ThreadContext.tryPerform(Mcas.scala:169)
at dev.tauri.choam.internal.mcas.Mcas$ThreadContext.tryPerform$(Mcas.scala:158)
at dev.tauri.choam.internal.mcas.emcas.EmcasThreadContext.tryPerform(EmcasThreadContext.scala:26)
at dev.tauri.choam.core.Rxn$InterpreterState.casLoop(Rxn.scala:1237)
at dev.tauri.choam.core.Rxn$InterpreterState.loop(Rxn.scala:1265)
at dev.tauri.choam.core.Rxn$InterpreterState.interpretWithContext(Rxn.scala:1550)
at dev.tauri.choam.core.Rxn$InterpreterState.interpret(Rxn.scala:1543)
at dev.tauri.choam.core.Rxn.unsafePerform(Rxn.scala:260)
at dev.tauri.choam.data.TtrieModelTest$TestState.insert(TtrieModelTest.scala:60)
at org.jetbrains.kotlinx.lincheck.runner.TestThreadExecution9.run(Unknown Source)
at org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor.testThreadRunnable$lambda-8(FixedActiveThreadsExecutor.kt:151)
at org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor$$Lambda$142/0x00007f381829d360.run(Unknown Source)
at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)
Locked ownable synchronizers:
- None
This might be a red herring, but looking at a heap dump, I see an org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor
with its results
array having an IllegalStateException
(with message Trying to switch the execution to thread 0,but only the following threads are eligible to switch: [1]
) at index 0
. Maybe this exception is thrown (probably here), and somehow this makes the two threads lock up? Although I'm not that familiar with the code...
Hi @durban!
Thank you for the bug report and the detailed investigation.
I was trying to reproduce the bug on the develop branch of the Lincheck, but it looks like the bug is already fixed on this branch. I did some bisecting and found out that the following commit fixes the bug (this line to be precise).
We will prepare a new release encompassing the latest changes from the develop branch, including the commit that fixes the bug you reported.
CC @ndkoval
Sounds good, thank you!
Hi @durban !
Could you please re-check if the 2.24 release fixes the issue for you?
Yes, 2.24 seems to work fine. Thanks!