OpenHFT/Chronicle-Threads

EventGroup sometimes times out on start

Closed this issue · 5 comments

EventGroup#start blocks until the core event loop has started. There have been multiple reports of performStart timing out and throwing an exception. This is unexpected and hasn't been reproduced reliably.

One client reported setting the timeout to 60s and still seeing the timeout, so probably not a case of the timeout being too short.

while (!isAlive()) {
try {
timeoutPauser.pause(WAIT_TO_START_MS, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
Jvm.error().on(EventGroup.class, format("Timed out waiting for start!%n" +
"%s%n%n" +
"%s%n%n",
EventLoopStateRenderer.INSTANCE.render("Core", core),
EventLoopStateRenderer.INSTANCE.render("Monitor", monitor)));
throw Jvm.rethrow(e);
}
}
}
@Override
protected void performStopFromNew() {
performStop();
}

This appears to be class-loader related and probably due to a dependency between static initialisers that can arise. The stack-trace of the blocked event loop thread is as follows:

{"time":"20221109-23:26:50.921", "level":"WARN", "log":"Thread dump", "exceptionCause":"Stack trace main/core-event-loop", "threadName":"main/core-event-loop", "threadId":"30", "threadState":"RUNNABLE", "waitCount":"0", "waitTime":"-1", "blockedCount":"0", "blockedTime":"-1", "lockName":"null", "lockO
com.company.ThreadDumpStackTraceWrapper: Stack trace main/core-event-loop
        at java.base@11.0.3/java.lang.ClassLoader.defineClass1(Native Method)
        at java.base@11.0.3/java.lang.ClassLoader.defineClass(ClassLoader.java:1016)
        at java.base@11.0.3/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
        at java.base@11.0.3/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:802)
        at java.base@11.0.3/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:700)
        at java.base@11.0.3/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:623)
        at java.base@11.0.3/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
        at java.base@11.0.3/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base@11.0.3/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
        at app//com.sun.jna.Native.initIDs(Native Method)
        at app//com.sun.jna.Native.<clinit>(Native.java:222)
        at app//com.sun.jna.Structure.setAlignType(Structure.java:280)
        at app//com.sun.jna.Structure.<init>(Structure.java:197)
        at app//com.sun.jna.Structure.<init>(Structure.java:193)
        at app//com.sun.jna.Structure.<init>(Structure.java:180)
        at app//com.sun.jna.Structure.<init>(Structure.java:172)
        at app//net.openhft.affinity.impl.LinuxHelper$utsname.<init>(LinuxHelper.java:183)
        at app//net.openhft.affinity.impl.LinuxHelper.<clinit>(LinuxHelper.java:37)
        at app//net.openhft.affinity.impl.LinuxJNAAffinity.<clinit>(LinuxJNAAffinity.java:42)
        at app//net.openhft.affinity.Affinity.isLinuxJNAAffinityUsable(Affinity.java:115)
        at app//net.openhft.affinity.Affinity.<clinit>(Affinity.java:54)
        at app//net.openhft.chronicle.core.threads.CleaningThread.run(CleaningThread.java:150)

I've seen this happen again a few times. I think perhaps we should do a JVM-wide thread dump when this happens to help troubleshoot it. It has been caused by a deadlock in static initialisers in the past, which was resolved by eagerly initialising some things.