random crash of Java application with Datadog Monitoring active
Closed this issue · 1 comments
Describe the bug
Java process crashes with following error message:
A fatal error has been detected by the Java Runtime Environment:
Internal Error (codeCache.cpp:654), pid=22359, tid=958
guarantee(is_result_safe || is_in_asgct()) failed: unsafe access to zombie methodJRE version: OpenJDK Runtime Environment Corretto-17.0.12.7.1 (17.0.12+7) (build 17.0.12+7-LTS)
Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.12.7.1 (17.0.12+7-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
Problematic frame:
V [libjvm.so+0x5b9d21] CodeCache::find_blob(void*)+0xc1Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e" (or dumping to /apps/bfr/calypso/p/cal01/6.200.00100.1009/client/core.22359)
JFR recording file will be written. Location: /apps/bfr/calypso/p/cal01/6.200.00100.1009/client/hs_err_pid22359.jfr
An error report file with more information is saved as:
/apps/bfr/calypso/p/cal01/log/hs_err_dataServer_pid22359.logIf you would like to submit a bug report, please visit:
https://github.com/corretto/corretto-17/issues//apps/bfr/calypso/p/cal01/current/deploy-local/p1/dataServer.sh: line 247: 22359 Aborted (core dumped) bash /apps/bfr/calypso/p/cal01/6.200.00100.1009/client/bin/calypso ${SSL_JVM_ARGS} com.calypso.apps.startup.StartCalypsoServer ${SPRING_CONFIG} -env p1 -log ${LOG_DIR_APP_ARGS} ${CALYPSO_APP_ARGS} ${GATEWAY_URL_APP_ARG} ${SERVER_PORT_ARG} ${OPENID_APP_ARGS} ${DEFAULT_SERVELET} $*
To Reproduce
Seems to occur at random once every few months or so. Different components of our distributed system are affected.
Expected behavior
No crash
Platform information
uname -srm
Linux 4.18.0-553.16.1.el8_10.x86_64 x86_64
java -version
openjdk version "17.0.12" 2024-07-16 LTS
OpenJDK Runtime Environment Corretto-17.0.12.7.1 (build 17.0.12+7-LTS)
OpenJDK 64-Bit Server VM Corretto-17.0.12.7.1 (build 17.0.12+7-LTS, mixed mode, sharing)
Additional context
We use Datadog (dd-agent.jar) for monitoring the application and suspect that there might be a relation. In order to reduce operational risk, we disabled datadog profiling for production and keep it on our test instances. We have not tested if this is Corretto-specific or a general OpenJDK problem, as this issue only happens very infrequently, we have no good way of testing that.
hs_err_pid22359.zip
hs_err_dataServer_pid22359.zip
DataDog support pointed us to this bug in JDK: https://bugs.openjdk.org/browse/JDK-8329103, which should be contained in the next 17.0.13.11.1 release of corretto as well. We hope that this will actually take care of this -- confirmation might take a while, as we have no test case to reproduce this.