JetBrains/lincheck

Hang in FixedActiveThreadsExecutor (Lincheck 2.19)

Closed this issue · 17 comments

When running on 2.18.1 a github action takes 8 minutes to run the tests. When upgrading, it is still running after 58 minutes. When running locally the tests complete without an error. I'll revert that upgrade to isolate it, but the other changes were benign code readability changes to please a static analyzer. So for now this is an fyi as I try to reproduce locally.

It canceled after a 60 minute timeout trying to verify an embedded copy of JCTools' MpscGrowableArrayQueue.

Gradle suite > Gradle test > com.github.benmanes.caffeine.cache.MpscGrowableArrayQueueLincheckTest > modelCheckingTest STARTED

Gradle suite > Gradle test > com.github.benmanes.caffeine.cache.MpscGrowableArrayQueueLincheckTest > stressTest STARTED

It ran fine when downgrading to 2.18.1. Since it appears to hang on 2.19, I probably need to ssh into it and run jstack or capture a jfr profile.

"Test worker" #1 prio=5 os_prio=0 cpu=99787.06ms elapsed=202.38s tid=0x00007f8a7002c800 nid=0x8ea waiting on condition  [0x00007f8a74f14000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.19/Native Method)
        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.19/LockSupport.java:357)
        at org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor.getResult(FixedActiveThreadsExecutor.kt:112)
        at org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor.awaitTask(FixedActiveThreadsExecutor.kt:93)
        at org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor.await(FixedActiveThreadsExecutor.kt:89)
        at org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor.submitAndAwait(FixedActiveThreadsExecutor.kt:67)
        at org.jetbrains.kotlinx.lincheck.runner.ParallelThreadsRunner.run(ParallelThreadsRunner.kt:246)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.runInvocation(ManagedStrategy.kt:226)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.modelchecking.ModelCheckingStrategy.runImpl(ModelCheckingStrategy.kt:62)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.run(ManagedStrategy.kt:121)
        at org.jetbrains.kotlinx.lincheck.LinChecker.run(LinChecker.kt:142)
        at org.jetbrains.kotlinx.lincheck.LinChecker.checkImpl(LinChecker.kt:75)
        at org.jetbrains.kotlinx.lincheck.LinChecker.checkImpl$lincheck(LinChecker.kt:47)
        at org.jetbrains.kotlinx.lincheck.LinChecker.check(LinChecker.kt:37)
        at com.github.benmanes.caffeine.cache.MpscGrowableArrayQueueLincheckTest.modelCheckingTest(MpscGrowableArrayQueueLincheckTest.java:69)

I believe in #130 (comment) I found that the framework swallows all throwables and retries, so that could mean this is timing out but never completing.

Not much insight from the jfr profile: 1.jfr.zip

Hi @ben-manes, could you please fix whether the issue is resolved in 2.20?

It's not looking great so far at 30 minutes (job) (vs 8 minutes).

I guess Lincheck 2.19+ works badly when the number of cores exceeds the number of threads. Is it the case here?

I'd have thought the opposite? It runs fine on my old macbook (8 ht cores) but fails on github actions (2 core). This hung on the model checker where I did not override the threads option, which looks to be 2.

@ben-manes, could you please check whether the fix #204 in Lincheck 2.21 solves the issue?

kicked off on x86 and arm64

no luck, 45min in so should timeout soon

I tried adjusting Runtime.getRuntime().availableProcessors() to a synthetic value by using -XX:ActiveProcessorCount, hoping to reproduce it locally or fix it remotely. Unfortunately that didn't work. I had thought maybe if it was core vs thread counts that might trick it, but it may be something else.

@ben-manes, is it possible to get a thread dump?

The jfr recording linked above includes a thread dump. I can try to capture a new one on the latest version if that helps, but will may have to wait until the weekend. I believe that I used action-upterm to ssh into the action before it ran the lincheckTest suite.

If you wanted to try to iterate on a bug fix, the approach that I would take is to make a reproducer repository that is trimmed to just the work needed, switch the lincheck dependency to jitpack and use a commit hash version, ssh as needed, and iterate until the cause is figured out.

I trimmed the build, added upterm, and ran the following on the host. The jfr and stacktrace files are attached. The profile recording has per-minute thread dumps and additional information.

touch continue

export LINCHECK=$(jps | grep GradleWorkerMain | cut -f1 -d' ')
echo $LINCHECK

jcmd $LINCHECK JFR.start duration=15m filename=/home/runner/work/caffeine/caffeine/lincheck.jfr settings=profile.jfc
sleep 960 && echo -ne '\007'
jcmd $LINCHECK JFR.check

zip lincheck lincheck.jfr
curl -s -F "file=@lincheck.zip" https://file.io | jq -r '.link'

jstack $LINCHECK > stacktrace.txt
curl -s -F "file=@stacktrace.txt" https://file.io | jq -r '.link'

lincheck.zip
stacktrace.txt

still no luck as of 2.23 (build)

Anything more that you need? I’m fine pinning to the old version but it’s a bit concerning of a regression and might confuse new users.

Hi, @ben-manes! Sorry for the very long reply. Could you please check whether the issue still reproduces in Lincheck 2.25?

works for me!