Significant JVM pod memory use jump after switching to AL2023-based AMIs

Question

Significant JVM pod memory use jump after switching to AL2023-based AMIs

priestjim opened this issue 6 months ago · 8 comments

What happened:

After migrating to AL2023-based from AL2-based EKS AMIs, Java-based workloads are consuming significantly more memory upon booting up (pods on AL2 nodes hovered at around 40% of their memory limit, on AL2023 nodes they hover around 80%, frequently getting killed by the OOM Killer without even the service actively being in use). The node and pod configuration are exactly the same, nodes have been simply been rolled by Karpenter with only the AMI name/family being changed. This has been repeated a couple of times to validate occurence (rolled back and forth between AL2 and AL2023 nodes and had the pods be rescheduled and inspected their memory use).

# cat /proc/(pid)/status on AL2023
[...]
VmPeak:	10511056 kB
VmSize:	10511056 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	  788224 kB
VmRSS:	  788224 kB
RssAnon:	  754280 kB
RssFile:	   33944 kB
RssShmem:	       0 kB
VmData:	 1040008 kB
VmStk:	     192 kB
VmExe:	       4 kB
VmLib:	   24840 kB
VmPTE:	    2672 kB
VmSwap:	       0 kB
HugetlbPages:	       0 kB
CoreDumping:	0
THP_enabled:	1
Threads:	121
[...]

# cat /proc/(pid)/status on AL2
[...]
VmPeak:	 5172272 kB
VmSize:	 5172272 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	  486220 kB
VmRSS:	  486120 kB
RssAnon:	  452508 kB
RssFile:	   33612 kB
RssShmem:	       0 kB
VmData:	  635676 kB
VmStk:	     192 kB
VmExe:	       4 kB
VmLib:	   24840 kB
VmPTE:	    1528 kB
VmSwap:	       0 kB
HugetlbPages:	       0 kB
CoreDumping:	0
THP_enabled:	1
Threads:	43
[...]

What you expected to happen:

Nominal memory use of a pod provisioned on either AL2 or AL2023-based EKS nodes should remain within the same boundaries.

How to reproduce it (as minimally and precisely as possible):

Launch a simple OpenJDK 11-based hello world application that allocates a significant amount of memory, with properly set -Xms and -Xmx and requests/limits in AL2 and AL2023.

Anything else we need to know?:

Environment:

AWS Region: us-west-2
Instance Type(s): Varies (Karpenter-managed)
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.7
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.29
AMI Version: amazon-eks-node-al2023-x86_64-standard-1.29-v20240615
Kernel (e.g. uname -a): 6.1.92-99.174.amzn2023.x86_64
Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-003ef79521859c6a4"
BUILD_TIME="Sat Jun 15 04:48:19 UTC 2024"
BUILD_KERNEL="6.1.92-99.174.amzn2023.x86_64"
ARCH="x86_64"

Answer 1 · 2024-06-25T07:07:36.000Z

I think you're running into the file descriptor limit change discussed in #1746.

On AL2, the limit for your containers is 810042 (both default and max). On AL2023, the default is 65536 and the max is 1048576.

The JVM bumps its file descriptor limit to the maximum (unless configured otherwise with -XX:-MaxFDLimit), and I assume it uses the file descriptor limit to allocate some region of memory (commonly for things like hash tables). That would explain your higher memory usage on AL2023.

You can try setting the limit to the AL2 values to see if that gets your metrics in line: https://awslabs.github.io/amazon-eks-ami/nodeadm/doc/examples/#modifying-container-rlimits

And if that does the trick, consider disabling the JVM's auto-increase of its file descriptor limit. You should see significantly lower memory allocated with the default limit on AL2023 of 65536.

Answer 2 · 2024-06-25T17:27:28.000Z

@cartermckinnon we tried:

AL2 file limits (soft/hard at 810042)
AL2023 default file limits (65536 / 1048576)
Setting -XX:-MaxFDLimit to the workloads

All to no avail, the memory use of the pods remained in the same high area (75+% of limits). Switching back to AL2 images brought the memory use again back down.

Answer 3 · 2024-06-25T17:34:46.000Z

@priestjim i believe Carter asked to try AL2023 with 810042 (both default and max) - i.e, use the same limits in AL2023 that were in your AL2 default kernel settings.

Answer 4 · 2024-06-25T17:44:36.000Z

@dims yes, we validated the metrics against AL2023 images using the AL2 limits (810042 for both soft/hard maximums), against AL2023 defaults and again against AL2/AL2023 limits with JAVA_TOOL_OPTIONS: -XX:-MaxFDLimit

Answer 5 · 2024-06-25T17:53:43.000Z

Thanks @priestjim I'll try to repro this and take a closer look.

Answer 6 · 2024-07-01T19:36:05.000Z

Which Java 11 version is being used? Versions prior to 11.0.16.0 do not have support for cgroup v2 which is used by AL2023. This leads to the JVM lacking container awareness and using whatever memory it can.

From the pod, what's the output of java -XshowSettings:system -version?

If this is seen then it supports cgroup v2:

Operating System Metrics:
    Provider: cgroupv2

If instead you see this then it does not have support for cgroup v2:

Operating System Metrics:
    No metrics available for this platform

Answer 7 · 2024-07-01T20:00:19.000Z

@JoeNorth you're absolutely correct in that the JDK version we're using (11.0.6) does not support cgroupv2. We'll look into upgrading to the latest OpenJDK 11 patch version and report back.

Answer 8 · 2024-07-01T20:50:24.000Z

ah thanks @JoeNorth, I missed the java version in the desc. In general you'll need openjdk 11 or 15+ to get cgroupv2 support, unless the distro you use has cherry-picked it.