AL2023: OOMKills caused by very high file descriptor limits
mattbrandman opened this issue ยท 12 comments
What happened:
Certain applications experience immediate OOM kills on AL2023 due to what I expect is a change in NOFILE's or Ulimit parameters on the AMI when compared to AL2
What you expected to happen:
The apps to launch without issue
How to reproduce it (as minimally and precisely as possible):
Launch https://github.com/DandyDeveloper/charts/blob/master/charts/redis-ha on an AL2023 node. It should immediately go into crash loop backoff
Anything else we need to know?:
AL2 does not experience this issue
Environment:
- AWS Region: us-east-1
- Instance Type(s): any instance on AL2023
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.3 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
): 2.29 - AMI Version: AL2023
- Kernel (e.g.
uname -a
): - Release information (run
cat /etc/eks/release
on a node):
๐ here, seeing the same, specifically with Elixir applications immediately exiting 137 OOM. I initially thought it was related to moving to cgroupv2, though. Confirmed reverting to AL2 has no issues, though.
Can you give us an example Pod with an Elixir program to reproduce this?
Hey @cartermckinnon, sure thing. This is just the latest Elixir image from https://hub.docker.com/_/elixir/ which tries to run a basic REPL but immediately crashes with an OOM error.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: hello-elixir
name: hello-elixir
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: hello-elixir
template:
metadata:
labels:
app.kubernetes.io/name: hello-elixir
spec:
automountServiceAccountToken: false
containers:
- image: elixir:1.16.2-slim
imagePullPolicy: Always
name: hello-elixir
resources:
limits:
memory: 1G
requests:
memory: 1G
Thanks @evandam! I'll repro this + the redis example mentioned above
Thanks @cartermckinnon! We resolved this by switching over AWS managed redis for the time being and no other apps have had that problem. But would be nice to know it wouldn't be if we switched back.
TLDR this is caused by the container's very high file descriptor limit.
If I create the elixir pod that @evandam gave above, I do indeed see the container get OOM killed immediately.
The memory limit on the cgroup is correct (1GB):
> cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6ea31ceb_8790_4d07_98e3_2a2f65b24544.slice/memory.max
999997440
So the OOM killer is probably doing the right thing. This program is trying to allocate a lot of memory (more than 1GB) right at startup. That sounds related to the erlang VM to me.
I swapped to an erlang pod and limited the size of BEAM/ERT'ss super carrier (https://www.erlang.org/doc/apps/erts/supercarrier) with some options doc'd here https://www.erlang.org/doc/man/erts_alloc.html:
apiVersion: v1
kind: Pod
metadata:
name: erlang
spec:
containers:
- image: erlang:27-slim
imagePullPolicy: Always
name: erlang
command:
- erl
args:
# enable super carrier only
- "+MMsco"
- "true"
# disable sys_alloc carriers
- "+Musac"
- "false"
# super carrier size in MB
- "+MMscs"
- "500"
resources:
limits:
memory: 1G
requests:
memory: 1G
And we no longer get OOM killed! We now get:
> kubectl logs erlang
ll_alloc: Cannot allocate 2147483711 bytes of memory (of type "port_tab").
Which is probably why we were getting OOM killed -- ERTS is trying to allocate > 2GB for the port table.
If you set the size of the port table to its minimum value by passing +Q 1024
, things work fine:
> kubectl logs erlang
hello, world
So looking into the size of the port table, we have our answer (emphasis mine): https://www.erlang.org/doc/man/erl.html
+Q Number
Sets the maximum number of simultaneously existing ports for this system if a Number is passed as value. Valid range for Number is [1024-134217727]
NOTE: The actual maximum chosen may be much larger than the actual Number passed. Currently the runtime system often, but not always, chooses a value that is a power of 2. This might, however, be changed in the future. The actual value chosen can be checked by calling erlang:system_info(port_limit).
The default value used is normally 65536. However, if the runtime system is able to determine maximum amount of file descriptors that it is allowed to open and this value is larger than 65536, the chosen value will increased to a value larger or equal to the maximum amount of file descriptors that can be opened.
ERTS is using the file descriptor limit to determine the number of entries in its port table.
Here's where that happens: https://github.com/erlang/otp/blob/b8d646f77d6f33e6aa06c38cb9da2c9ac2dc9d9b/erts/emulator/beam/io.c#L2989-L2998
The file descriptor limit of containerd
propagates down to your containers:
> systemctl cat containerd | grep LimitNOFILE
LimitNOFILE=infinity
On AL2023, infinity
means 2^63-1:
> sysctl fs.file-max
fs.file-max = 9223372036854775807
That comes from systemd
, version 240 and above: https://github.com/systemd/systemd/blob/a6ab3053aab515ecae7568e0beefee7dbe6f9100/NEWS#L9001
It will bump the default to the max value: https://github.com/systemd/systemd/blob/a6ab3053aab515ecae7568e0beefee7dbe6f9100/src/core/main.c#L1247-L1249
If I change containerd
's LimitNOFILE
to something more sensible, like 1024:65536
, I can remove +Q 1024
and things work as expected.
So we'll need to either lower the system-wide fs.file-max
or set containerd
's soft limit on file descriptors to something much lower than infinity
. ๐
Redis is doing something similar, it bumps the file descriptor limit as much as possible: https://github.com/redis/redis/blob/f95031c4733078788063de775c968b6dc85522c0/src/server.c#L2301-L2302
and then uses that value to allocate a bunch of memory:
@cartermckinnon thank you for this very very thorough investigation! When do you think the change will be seen in an AMI update? I'd be happy to test things out when there's a new version available ๐
Awesome to hear! I'll follow along on your PR and monitor for when a new AMI is available to test out.
Thanks @cartermckinnon