AL2023: OOMKills caused by very high file descriptor limits

Question

AL2023: OOMKills caused by very high file descriptor limits

mattbrandman opened this issue 9 months ago · 12 comments

What happened:
Certain applications experience immediate OOM kills on AL2023 due to what I expect is a change in NOFILE's or Ulimit parameters on the AMI when compared to AL2
What you expected to happen:
The apps to launch without issue
How to reproduce it (as minimally and precisely as possible):
Launch https://github.com/DandyDeveloper/charts/blob/master/charts/redis-ha on an AL2023 node. It should immediately go into crash loop backoff
Anything else we need to know?:
AL2 does not experience this issue
Environment:

AWS Region: us-east-1
Instance Type(s): any instance on AL2023
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.3
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 2.29
AMI Version: AL2023
Kernel (e.g. uname -a):
Release information (run cat /etc/eks/release on a node):

Answer 1 · 2024-05-01T18:07:15.000Z

👍 here, seeing the same, specifically with Elixir applications immediately exiting 137 OOM. I initially thought it was related to moving to cgroupv2, though. Confirmed reverting to AL2 has no issues, though.

Answer 2 · 2024-05-01T20:32:49.000Z

Can you give us an example Pod with an Elixir program to reproduce this?

Answer 3 · 2024-05-01T20:51:05.000Z

Hey @cartermckinnon, sure thing. This is just the latest Elixir image from https://hub.docker.com/_/elixir/ which tries to run a basic REPL but immediately crashes with an OOM error.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: hello-elixir
  name: hello-elixir
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: hello-elixir
  template:
    metadata:
      labels:
        app.kubernetes.io/name: hello-elixir
    spec:
      automountServiceAccountToken: false
      containers:
      - image: elixir:1.16.2-slim
        imagePullPolicy: Always
        name: hello-elixir
        resources:
          limits:
            memory: 1G
          requests:
            memory: 1G

Answer 4 · 2024-05-02T00:52:54.000Z

Thanks @evandam! I'll repro this + the redis example mentioned above

Answer 5 · 2024-05-02T01:20:19.000Z

Thanks @cartermckinnon! We resolved this by switching over AWS managed redis for the time being and no other apps have had that problem. But would be nice to know it wouldn't be if we switched back.

Answer 6 · 2024-05-02T04:26:52.000Z

TLDR this is caused by the container's very high file descriptor limit.

If I create the elixir pod that @evandam gave above, I do indeed see the container get OOM killed immediately.

The memory limit on the cgroup is correct (1GB):

> cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6ea31ceb_8790_4d07_98e3_2a2f65b24544.slice/memory.max
999997440

So the OOM killer is probably doing the right thing. This program is trying to allocate a lot of memory (more than 1GB) right at startup. That sounds related to the erlang VM to me.

I swapped to an erlang pod and limited the size of BEAM/ERT'ss super carrier (https://www.erlang.org/doc/apps/erts/supercarrier) with some options doc'd here https://www.erlang.org/doc/man/erts_alloc.html:

apiVersion: v1
kind: Pod
metadata:
  name: erlang
spec:
  containers:
  - image: erlang:27-slim
    imagePullPolicy: Always
    name: erlang
    command:
    - erl
    args:
    # enable super carrier only
    - "+MMsco"
    - "true"
    # disable sys_alloc carriers
    - "+Musac"
    - "false"
    # super carrier size in MB
    - "+MMscs"
    - "500"
    resources:
      limits:
        memory: 1G
      requests:
        memory: 1G

And we no longer get OOM killed! We now get:

> kubectl logs erlang
ll_alloc: Cannot allocate 2147483711 bytes of memory (of type "port_tab").

Which is probably why we were getting OOM killed -- ERTS is trying to allocate > 2GB for the port table.

If you set the size of the port table to its minimum value by passing +Q 1024, things work fine:

> kubectl logs erlang
hello, world

So looking into the size of the port table, we have our answer (emphasis mine): https://www.erlang.org/doc/man/erl.html

+Q Number
Sets the maximum number of simultaneously existing ports for this system if a Number is passed as value. Valid range for Number is [1024-134217727]
NOTE: The actual maximum chosen may be much larger than the actual Number passed. Currently the runtime system often, but not always, chooses a value that is a power of 2. This might, however, be changed in the future. The actual value chosen can be checked by calling erlang:system_info(port_limit).
The default value used is normally 65536. However, if the runtime system is able to determine maximum amount of file descriptors that it is allowed to open and this value is larger than 65536, the chosen value will increased to a value larger or equal to the maximum amount of file descriptors that can be opened.

ERTS is using the file descriptor limit to determine the number of entries in its port table.

Here's where that happens: https://github.com/erlang/otp/blob/b8d646f77d6f33e6aa06c38cb9da2c9ac2dc9d9b/erts/emulator/beam/io.c#L2989-L2998

Answer 7 · 2024-05-02T04:39:43.000Z

The file descriptor limit of containerd propagates down to your containers:

> systemctl cat containerd | grep LimitNOFILE
LimitNOFILE=infinity

On AL2023, infinity means 2^63-1:

> sysctl fs.file-max
fs.file-max = 9223372036854775807

That comes from systemd, version 240 and above: https://github.com/systemd/systemd/blob/a6ab3053aab515ecae7568e0beefee7dbe6f9100/NEWS#L9001

It will bump the default to the max value: https://github.com/systemd/systemd/blob/a6ab3053aab515ecae7568e0beefee7dbe6f9100/src/core/main.c#L1247-L1249

If I change containerd's LimitNOFILE to something more sensible, like 1024:65536, I can remove +Q 1024 and things work as expected.

So we'll need to either lower the system-wide fs.file-max or set containerd's soft limit on file descriptors to something much lower than infinity. 😄

Answer 8 · 2024-05-02T06:17:38.000Z

Redis is doing something similar, it bumps the file descriptor limit as much as possible: https://github.com/redis/redis/blob/f95031c4733078788063de775c968b6dc85522c0/src/server.c#L2301-L2302

and then uses that value to allocate a bunch of memory:

Answer 9 · 2024-05-02T16:21:49.000Z

@cartermckinnon thank you for this very very thorough investigation! When do you think the change will be seen in an AMI update? I'd be happy to test things out when there's a new version available 🙌

Answer 10 · 2024-05-10T21:26:51.000Z

@evandam I've got a PR open for this (#1794) that should land in the next week or so. If you want to give it a try on a live node, you can grab that build like:

go install github.com/awslabs/amazon-eks-ami/nodeadm/...@al2023-rlimits

Answer 11 · 2024-05-10T22:37:56.000Z

Awesome to hear! I'll follow along on your PR and monitor for when a new AMI is available to test out.

Answer 12 · 2024-06-01T01:30:49.000Z

Thanks @cartermckinnon