Memory leak, C2 compiler
shalseth opened this issue · 12 comments
Please provide a brief summary of the bug
As of Temurin 21.0.3 we are experiencing excessive memory usage (and oom-killer) from our Java Tomcat containers.
This is off-heap memory usage, and we can easily trigger a 10GB mem usage (real mem usage, memstat:total_rss) with a Xmx=1G. From our current testing, the mem usage will go as high as whatever we configure as the docker container memory limit.
From a total_rss usage of 1GB, the mem usage within ~10 seconds goes up to the limit 10GB.
Application memory leaks are not a factor, we have checked heap with jmap, just to make sure.
We have no issues on Temurin 21.0.1 og 21.0.2, only starting with 21.0.3. Same issue also present on latest 21.0.5.
We have isolated the memory usage to the C2 Compiler thread, based on oom-killer syslog output, and with Native Memory Tracking.
On a lightly loaded production node we experience this after 1-2 hours, so we can quite easily reproduce the issue.
We are using Tomcat images from https://hub.docker.com/_/tomcat.
Ideally we should have some code or easy way to reproduce this outside of our application, but our application is quite large, and we don't know exactly what triggers the issue.
Realize it will be hard to track down based on this bug report, but I will post it anyway, in case others experience similar problems and / or there's some additional logs or debug output we could provide.
Did you test with the latest update version?
- Yes
Please provide steps to reproduce where possible
- Start Java Tomcat container
- Run our closed source application (this might be a problem, I guess)
Expected Results
Memory usage similar to Temurin 21.0.1 - 21.0.2
Actual Results
Consumes all available memory up to docker container limit.
What Java Version are you using?
openjdk 21.0.5 2024-10-15 LTS OpenJDK Runtime Environment Temurin-21.0.5+11 (build 21.0.5+11-LTS) OpenJDK 64-Bit Server VM Temurin-21.0.5+11 (build 21.0.5+11-LTS, mixed mode, sharing)
What is your operating system and platform?
Ubuntu 20.04 docker host
Container image based on Ubuntu 22.04. Dockerhub: tomcat:9.0.90-jdk21-temurin-jammy
How did you install Java?
Docker image from https://hub.docker.com/_/tomcat, which uses eclipse-temurin:21-jdk-jammy
Did it work before?
Work on the following versions:
openjdk 21.0.1 2023-10-17 LTS
OpenJDK Runtime Environment Temurin-21.0.1+12 (build 21.0.1+12-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.1+12 (build 21.0.1+12-LTS, mixed mode, sharing)
openjdk 21.0.2 2024-01-16 LTS
OpenJDK Runtime Environment Temurin-21.0.2+13 (build 21.0.2+13-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.2+13 (build 21.0.2+13-LTS, mixed mode, sharing)
Did you test with other Java versions?
Also failing on:
openjdk 21.0.3 2024-04-16 LTS
OpenJDK Runtime Environment Temurin-21.0.3+9 (build 21.0.3+9-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.3+9 (build 21.0.3+9-LTS, mixed mode, sharing)
Relevant log output
OOM killer (syslog)
Oct 27 02:20:14 <hostname> kernel: [12313888.554885] C2 CompilerThre invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
Oct 27 02:20:14 <hostname> kernel: [12313888.554911] oom_kill_process.cold+0xb/0x10
Oct 27 02:20:14 <hostname> kernel: [12313888.554975] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Oct 27 02:20:14 <hostname> kernel: [12313888.554988] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=8846fd420713df8fdbf8d257b6e75a27893a1a339e7458d7f7310c50f2d62b6d,mems_allowed=0-1,oom_memcg=/docker/8846fd420713df8fdbf8d257b6e75a27893a1a339e7458d7f7310c50f2d62b6d,task_memcg=/docker/8846fd420713df8fdbf8d257b6e75a27893a1a339e7458d7f7310c50f2d62b6d,task=java,pid=1235475,uid=1337
Oct 27 02:20:14 <hostname> kernel: [12313888.555474] Memory cgroup out of memory: Killed process 1235475 (java) total-vm:15036484kB, anon-rss:10207028kB, file-rss:39704kB, shmem-rss:0kB, UID:1337 pgtables:22964kB oom_score_adj:0
Java Native Memory Tracking output (as the problem is happening, Compiler memory usage is growing). Running with Xmx=1G
docker@<hostname>(devxx):~$ docker exec -i tomcat-$serviceGroup jcmd 1 VM.native_memory
Native Memory Tracking:
(Omitting categories weighting less than 1KB)
Total: reserved=8444154KB, committed=6461730KB
malloc: 5747358KB #685549
mmap: reserved=2696796KB, committed=714372KB
- Java Heap (reserved=1024000KB, committed=464896KB)
(mmap: reserved=1024000KB, committed=464896KB)
- Class (reserved=1052409KB, committed=17977KB)
(classes #20859)
( instance classes #19720, array classes #1139)
(malloc=3833KB #78983) (peak=3849KB #79169)
(mmap: reserved=1048576KB, committed=14144KB)
( Metadata: )
( reserved=131072KB, committed=102144KB)
( used=100769KB)
( waste=1375KB =1,35%)
( Class space:)
( reserved=1048576KB, committed=14144KB)
( used=12346KB)
( waste=1798KB =12,71%)
- Thread (reserved=176700KB, committed=21028KB)
(thread #172)
(stack: reserved=176128KB, committed=20456KB)
(malloc=372KB #1049) (peak=494KB #1365)
(arena=200KB #342) (peak=11107KB #356)
- Code (reserved=253719KB, committed=63699KB)
(malloc=6031KB #15848) (peak=6031KB #15809)
(mmap: reserved=247688KB, committed=57668KB)
- GC (reserved=61148KB, committed=50292KB)
(malloc=8244KB #26589) (peak=8282KB #26796)
(mmap: reserved=52904KB, committed=42048KB)
- GCCardSet (reserved=733KB, committed=733KB)
(malloc=733KB #2566) (peak=740KB #2566)
- Compiler (reserved=5675077KB, committed=5675077KB)
(malloc=1348KB #1300) (peak=1682KB #1416)
(arena=5673729KB #9) (at peak)
- Internal (reserved=1799KB, committed=1799KB)
(malloc=1763KB #51500) (peak=1863KB #53933)
(mmap: reserved=36KB, committed=36KB)
- Other (reserved=2742KB, committed=2742KB)
(malloc=2742KB #213) (peak=2764KB #255)
- Symbol (reserved=30217KB, committed=30217KB)
(malloc=25127KB #295303) (peak=25128KB #295307)
(arena=5090KB #1) (at peak)
- Native Memory Tracking (reserved=10905KB, committed=10905KB)
(malloc=194KB #3433) (peak=203KB #3579)
(tracking overhead=10712KB)
- Shared class space (reserved=16384KB, committed=12972KB, readonly=0KB)
(mmap: reserved=16384KB, committed=12972KB)
- Arena Chunk (reserved=2780KB, committed=2780KB)
(malloc=2780KB #177676) (peak=5681830KB #177677)
- Tracing (reserved=32KB, committed=32KB)
(arena=32KB #1) (at peak)
- Module (reserved=290KB, committed=290KB)
(malloc=290KB #4410) (peak=297KB #4412)
- Safepoint (reserved=8KB, committed=8KB)
(mmap: reserved=8KB, committed=8KB)
- Synchronization (reserved=2454KB, committed=2454KB)
(malloc=2454KB #24125) (peak=2459KB #24175)
- Serviceability (reserved=18KB, committed=18KB)
(malloc=18KB #37) (peak=66KB #369)
- Metaspace (reserved=132724KB, committed=103796KB)
(malloc=1652KB #2421) (at peak)
(mmap: reserved=131072KB, committed=102144KB)
- String Deduplication (reserved=1KB, committed=1KB)
(malloc=1KB #8) (at peak)
- Object Monitors (reserved=14KB, committed=14KB)
(malloc=14KB #71) (peak=129KB #637)
Manual logging total_rss per second (from cgroup v1 memstat)
Test container Xmx=1G, container limit 2,5GB. oom-killer disabled.
Tue Oct 29 14:00:00 CET 2024: Total RSS: 933.01 MB
Tue Oct 29 14:00:01 CET 2024: Total RSS: 1071.18 MB
------ Problem starts here ------
Tue Oct 29 14:00:02 CET 2024: Total RSS: 1457.25 MB
Tue Oct 29 14:00:03 CET 2024: Total RSS: 1862.92 MB
Tue Oct 29 14:00:04 CET 2024: Total RSS: 2231.92 MB
Tue Oct 29 14:00:05 CET 2024: Total RSS: 2451.58 MB
------ Reached container mem limit 2,5GB ------
JVM freezes, out of memory, and cannot create new threads:
java.io.IOException: Cannot allocate memory
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
Tue Oct 29 14:00:06 CET 2024: Total RSS: 2451.58 MB
Tue Oct 29 14:00:08 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:09 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:10 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:11 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:12 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:13 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:14 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:15 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:16 CET 2024: Total RSS: 2456.09 MB
Tue Oct 29 14:00:17 CET 2024: Total RSS: 2456.09 MB
@shalseth What is your app doing roughly? Is it a pure Java app or does it have Scala, other JVM languages or any native components?
@karianna , thank for you reply. No other native components, just pure Java.
It's a large web application with many functions, running in Tomcat.
In this particular case however, it's purely MySQL database transactions using JDBC. Fetching data in a ResultSet, and serializing to JSON using gson.
What strikes me as weird is that we don't see anything like this on Temurin 21.0.1 / 21.0.2 after countless hours in production across several instances, and we can reproduce this in a few minutes on 21.0.3-21.0.5. Currently only from random clicking around in our application, but perhaps we can isolate it further at some point.
I did manage to get some more logs today when reproducing the issue. Used the following flags:
-Xlog:jit+compilation=debug
-XX:+UnlockDiagnosticVMOptions
-XX:LogFile=/var/log/tomcat9/hotspot_pid1.log
-XX:+LogCompilation
-XX:+PrintCompilation
-XX:+PrintCompilation2
Found a compilation task that takes 9995ms , and eats memory at the time I observe memory spikes, seen from outside of the docker container. In this case it consumed about +4GB of RSS memory. Xmx=1G.
-XX:+PrintCompilation / -XX:+PrintCompilation2
52631 14809 % ! 4 <package.name.redacted>.controller.ActivityController::getActivities @ 3075 (4542 bytes)
62627 14809 % size: 187248(114816) time: **9995** inlined: 5335 bytes
hotspot_pid1.log (-XX:+LogCompilation)
<nmethod compile_id='14809' compile_kind='osr' compiler='c2' level='4' entry='0x00007f08ee6adca0' size='203192' address='0x00007f08ee6aa610' relocation_offset='344' insts_offset='13968' stub_offset='12878
4' scopes_data_offset='134016' scopes_pcs_offset='184104' dependencies_offset='193432' handler_table_offset='193832' nul_chk_table_offset='202736' oops_offset='132440' metadata_offset='132592' method='no.
<package.name.redacted>.controller.ActivityController getActivities (Lxx/xx/portal/core/model/CorporationGroup;Ljava/lang/String;Lxx/xx/portal/xxxxx/model/ActivityFilter;Lxx/xx/portal/core/model/database/OrderColumn;Lxx/xxx/portal/core/model/database/QueryLimit;Lxx/xxx/portal/core/database/Transaction;)Ljava/util/List;' bytes='4542' count='194' backedge_count='114869' iicount='194' stamp='62,606'/>
62627 14809 % size: 187248(114816) time: 9995 inlined: 5335 bytes
See attached debug log from the compiler thread as well.
Thanks for the extra details - reported upstream at https://bugs.openjdk.org/browse/JDK-8343322
@karianna, following up with additional information based on upstream questions. I don't have an account at bugs.openjdk.org, so I would appreciate if you could forward this information.
Question:
Could the reporter try to narrow it down to a specific change by by building the VM himself?
Answer:
I've built OpenJDK from source and isolated the commit that introduces this behaviour:
See: https://bugs.openjdk.org/browse/JDK-8309203
I've built a version of OpenJDK up until the commit before the one above:
commit 8fbd10bc5db476abf76a55a456f811132041301f
Author: Sergey Bylokhov serb@openjdk.org 2024-01-03 22:00:07
This version does not display any of the mentioned symptoms.
I will gladly assist with reproducing the issue, and trying out any patches / changes.
We see the same behaviour on brief tests on current versions of JDK 22 and 23 as well.
Builds / commits:
No issues: 24b3b2b66f0bed3e9812999a5b6d511f31e9ad1f (2023-12-31)
No issues: 8fbd10bc5db476abf76a55a456f811132041301f (2024-01-03)
Issue observed: 966071a5f10a7a8910b2639e10c961924a07d675 (2024-01-04) <--- Issue first observed at this commit
Issue observed: b4c163d07b2db96ce91a0f712ced60ace14dbf63 (2024-01-17)
Question:
And is this really a memory leak? From the report it sounds like it's rather high memory consumption by a single compilation and the memory is freed right after.
Answer:
In this instance, the memory is not released for almost two minutes, so it at least looks like a temporary(?) leak ;)
It's worth to note that although the pattern we see is the same in test/production, we've seen much higher memory usages in the few instances we've run this in production.
Don't know how far it would have gone, but we've hit the container limit after consuming about 9G. Haven't seen that high number in the test environment yet.
Logs:
The abrupt rise in memory consumption is accompanied by a CPU spike, which stops after a few seconds. At the same time as the memory usage stops rising.
Log output from test run from a version of OpenJDK 21.0.3+9 that we've compiled today.
Sun Nov 3 15:46:06 CET 2024: Total RSS: 503.45 MB
Sun Nov 3 15:46:08 CET 2024: Total RSS: 523.17 MB
Sun Nov 3 15:46:09 CET 2024: Total RSS: 803.67 MB
Sun Nov 3 15:46:10 CET 2024: Total RSS: 1371.32 MB
Sun Nov 3 15:46:11 CET 2024: Total RSS: 1945.73 MB
Sun Nov 3 15:46:12 CET 2024: Total RSS: 2519.87 MB
Sun Nov 3 15:46:13 CET 2024: Total RSS: 3088.35 MB
Sun Nov 3 15:46:14 CET 2024: Total RSS: 3653.48 MB
Sun Nov 3 15:46:15 CET 2024: Total RSS: 4113.67 MB
Sun Nov 3 15:46:16 CET 2024: Total RSS: 4133.04 MB
Sun Nov 3 15:46:17 CET 2024: Total RSS: 4128.20 MB
Sun Nov 3 15:46:18 CET 2024: Total RSS: 4107.51 MB
Sun Nov 3 15:46:19 CET 2024: Total RSS: 4060.11 MB
Sun Nov 3 15:46:20 CET 2024: Total RSS: 4035.06 MB
Sun Nov 3 15:46:21 CET 2024: Total RSS: 4035.06 MB
Sun Nov 3 15:46:22 CET 2024: Total RSS: 2815.38 MB
Sun Nov 3 15:46:23 CET 2024: Total RSS: 2815.51 MB
Sun Nov 3 15:46:24 CET 2024: Total RSS: 2815.51 MB
Sun Nov 3 15:46:25 CET 2024: Total RSS: 2815.23 MB
Sun Nov 3 15:46:26 CET 2024: Total RSS: 2814.51 MB
Sun Nov 3 15:46:27 CET 2024: Total RSS: 2814.51 MB
Sun Nov 3 15:46:28 CET 2024: Total RSS: 2814.51 MB
Sun Nov 3 15:46:29 CET 2024: Total RSS: 2814.51 MB
Sun Nov 3 15:46:30 CET 2024: Total RSS: 2793.37 MB
Sun Nov 3 15:46:31 CET 2024: Total RSS: 2793.37 MB
Sun Nov 3 15:46:32 CET 2024: Total RSS: 2793.37 MB
Sun Nov 3 15:46:33 CET 2024: Total RSS: 2793.21 MB
Sun Nov 3 15:46:34 CET 2024: Total RSS: 2793.21 MB
Sun Nov 3 15:46:35 CET 2024: Total RSS: 2793.21 MB
Sun Nov 3 15:46:36 CET 2024: Total RSS: 2793.21 MB
Sun Nov 3 15:46:37 CET 2024: Total RSS: 2793.21 MB
Sun Nov 3 15:46:38 CET 2024: Total RSS: 2793.33 MB
Sun Nov 3 15:46:39 CET 2024: Total RSS: 2793.33 MB
Sun Nov 3 15:46:40 CET 2024: Total RSS: 2793.33 MB
Sun Nov 3 15:46:41 CET 2024: Total RSS: 2794.50 MB
Sun Nov 3 15:46:42 CET 2024: Total RSS: 2794.50 MB
....
Sun Nov 3 15:48:00 CET 2024: Total RSS: 520.06 MB
Thanks a lot for narrowing it down @shalseth. I'll put that information into JDK-8343322.
Is this still reproducible with JDK-8340824? @shalseth maybe you can take the patch and apply it to see if it helps:
openjdk/jdk@90c944f
@TobiHartmann, JDK-8340824 looks really promising so far!
I tried applying the patch to jdk-21.0.3-ga, which is quite old compared to the new patch, but it applied cleanly, and I see an immediate improvement.
Then did the same test with the patch on jdk-21.0.6+1 (commit: 7dc0f7a64224d37f639ab8e8da2c1aa3295cc92e).
In both cases I can notice a slightly larger memory footprint, with the patch applied, compared to 21.0.2, but it's negligible, and to the point where I wouldn't consider it a problem, or even know about it in the first place.
With the patch applied, the behaviour is much more in line with 21.0.1/21.0.2, and saves around 5-6GB in peak memory consumption.
Now during my test it goes from 500MB -> 2000MB -> 500MB during ~10 seconds, which is completely fine.
With 21.0.2, for comparison, would be around 500MB -> 1600MB -> 500MB, but again, this is not a big deal and won't cause any problems.
Note that this is an early report, so I will do some more testing, perhaps even with production workload.
Unless I have new findings that contradict my current findings with the patch, I would consider it fixed, assuming this patch finds its way into next 21.0.X release.
That's great news, thanks for verifying @shalseth. I'll keep JDK-8343322 open for now until final feedback from you.
assuming this patch finds its way into next 21.0.X release
That's up to the OpenJDK maintainers to decide, I'm representing the Oracle JDK here and we backported the fix to Oracle JDK 21.0.7 already.
Been running jdk-21.0.6+1 (commit: 7dc0f7a64224d37f639ab8e8da2c1aa3295cc92e) with the patch from JDK-8340824 in production on a small subset of containers since yesterday.
We've had no issues so far. The original issue would present itself within max 1-2 hours, so I'm closing the issue.
I'm planning to stick with Temurin 21.0.2 in production for now, and upgrade when a Temurin 21.0.X build based on OpenJDK with the patch applied becomes available. Will do the same testing then.
Thanks a lot for your help and assistance :)
Great, thanks again!