kubernetes/kubernetes

Make OOM not be a SIGKILL

grosser opened this issue Β· 65 comments

Atm apps that go over the memory limit are hard killed 'OOMKilled', which is bad (losing state / not running cleanup code etc)

Is there a way to get SIGTERM instead (with a grace period or 100m before reaching the limit) ?

@kubernetes/sig-node-feature-requests

vishh commented

It is not possible to change OOM behavior currently. Kubernetes (or runtime) could provide your container a signal whenever your container is close to its memory limit. This will be on a best effort basis though because memory spikes might not be handled on time.

FYI using this crutch atm https://github.com/grosser/preoomkiller

any idea what would need to change to make OOM behavior configureable ?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

/remove-lifecycle stale
/cc @dashpole

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Was this meant to be closed?

It seems like @yujuhong meant to say /remove-lifecycle rotten?

/remove-lifecycle rotten

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

When the node is reaching OOM levels I guess I understand some SIGKILLs happening but when a pod is reaching it's manually set resource limit it also gets a SIGKILL. As the initial post mentions, this can cause a lot of harm.

As a workaround we're going to try and make the pod unhealthy before it reaches the memory limit to get a graceful shutdown.
Kubernetes sending this signal beforehand would solve the issue.

If I want this feature created, how should I go about it? Should I provide a PR with code changes or ping someone to make a proposal?

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

gajus commented

This remains an active issue.

There appears to be no way to gracefully handle OOMKilled at the moment.

@dashpole Can this be re-opened?

/reopen
/remove-lifecycle rotten

@dashpole: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gajus commented

I have since addressed this issue by implementing memory usage within the program itself and gracefully terminating the service when it researches near the limit memory usage.

I have open-sourced the abstraction used to retrieve the Kubernetes resources for the current running pod.

https://github.com/gajus/preoom

what advantage has asking the metrics server instead of inspecting local memory state as preoomkiller does ?

gajus commented

For one, as far as I can see, preoomkiller requires you to input the limits. Where as preoom uses the same API to pull the data required to construct the check.

preoomkiller does not need limit input, it reads the limits from /sys/fs/cgroup/memory/memory.usage_in_bytes and /sys/fs/cgroup/memory/memory.stat by default

gajus commented

How does it know what is configured in resource.limits.memory?

hierarchical_memory_limit from /sys/fs/cgroup/memory/memory.stat

gajus commented

Understood. Wasn't familiar that this information is available in the fs.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

My knowledge of this stuff is pretty limted, but kernel seems to indicate its at least possible to get a notification when there is memory pressure in the cgroup:

  1. Memory thresholds

Memory cgroup implements memory thresholds using the cgroups notification
API (see cgroups.txt). It allows to register multiple memory and memsw
thresholds and gets notifications when it crosses.

To register a threshold, an application must:

  • create an eventfd using eventfd(2);
  • open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
  • write string like "<event_fd> " to
    cgroup.event_control.

Application will be notified through eventfd when memory usage crosses
threshold in any direction.

It's applicable for root and non-root cgroup.

See https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

izakp commented

After following this thread I wrote this tool which you can use to send a nice SIGTERM to your entrypoint script or pid 1 of the container https://github.com/izakp/docker-preoomkiller

I'm also of the opinion that it's not Docker or Kubernetes' ultimate responsibility to nicely terminate containers that reach their memory limit (ideally processes should respect their memory constraints) but this could be useful as a shim to watch and gracefully terminate processes with memory leaks, etc...

I came up with this command as a k8s cron job to kill 1 pod exceeding threshold every time the cronjob runs. This keeps memory creep in check with minimal disruption.

args:
- /bin/sh
- -exc
- pod=$(kubectl top pods -n '{{ .Release.Namespace }}' -l '{{ .Values.memoryReaper.selector }}' | sort -rk 3 | awk '$3+0 >= '{{ .Values.memoryReaper.threshold }}' {print $1}' | head -n1); [[ -n "$pod" ]] && kubectl delete -n {{ .Release.Namespace }} pod $pod || true

caveat: It does not work when memory is counted in Gb instead of Mb.

Does anyone have a solution?thanks

gajus commented

I have since addressed this issue by implementing memory usage within the program itself and gracefully terminating the service when it researches near the limit memory usage.

I have open-sourced the abstraction used to retrieve the Kubernetes resources for the current running pod.

https://github.com/gajus/preoom

^

izakp commented

@yxxhero we are using https://github.com/artsy/docker-preoomkiller in production to soft-restart some image resizing background task containers that tend to bloat in memory - it's working quite well

Any particular reason why the ability to disable oom killer for the container isn't going to happen? OOM killer is a pretty horrible/stupid way to handle out of memory conditions, twice as much inside kubernetes container with a LIMIT on it.

Is there any reason why this requirement can't be met with terminationGracePeriodSeconds (which does a SIGTERM x seconds before the SIGKILL?)

Is there any reason why this requirement can't be met with terminationGracePeriodSeconds (which does a SIGTERM x seconds before the SIGKILL?)

The SIGKILL comes directly from the Linux kernel: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt

Doesn't the pod get a SIGTERM first, then after grace period gets a SIGKILL?
https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

Or does this grace period not apply to the OOMKiller?

Doesn't the pod get a SIGTERM first, then after grace period gets a SIGKILL?
https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

Or does this grace period not apply to the OOMKiller?

OOMKiller is triggered by the cgroup function of the OS kernel, container runtime & K8S only passively accepts this result, they have no opportunity to actively trigger actions.

case r.State.OOMKilled:
// TODO: consider exposing OOMKilled via the runtimeAPI.
// Note: if an application handles OOMKilled gracefully, the
// exit code could be zero.
reason = "OOMKilled"

It is the Kernel oomkiller that is killing pods, so maybe the kernel can also be the solution here?

In reading the kernel docs, It looks like the cgroup memory controller (v1) has had support for indicating the memory pressure for at least 10 years now:

memory.pressure_level		 # set memory pressure notifications

If a pod is able to mount and write to its own cgroup, It may be the case that it can configure the memory controller for the cgroup to use these pressure notifications. Presumably, each app would need to be tuned to determine at what pressure level it may need to take some special action, such as:

  • At "low" do nothing
  • At "medium", Triggering garbage collection to free up space
  • At "critical", Either try to reclaim memory, or start a graceful shutdown / stop accepting new work, so that the pod can be restarted without interrupting service.

These levels are described in the Kernel documentation

11. Memory Pressure

The pressure level notifications can be used to monitor the memory
allocation cost; based on the pressure, applications can implement
different strategies of managing their memory resources. The pressure
levels are defined as following:

The "low" level means that the system is reclaiming memory for new
allocations. Monitoring this reclaiming activity might be useful for
maintaining cache level. Upon notification, the program (typically
"Activity Manager") might analyze vmstat and act in advance (i.e.
prematurely shutdown unimportant services).

The "medium" level means that the system is experiencing medium memory
pressure, the system might be making swap, paging out active file caches,
etc. Upon this event applications may decide to further analyze
vmstat/zoneinfo/memcg or internal memory usage statistics and free any
resources that can be easily reconstructed or re-read from a disk.

The "critical" level means that the system is actively thrashing, it is
about to out of memory (OOM) or even the in-kernel OOM killer is on its
way to trigger. Applications should do whatever they can to help the
system. It might be too late to consult with vmstat or any other
statistics, so it's advisable to take an immediate action.

However, to use this, it looks like you have to be able to write to the cgroup filesystem:

write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>"
  to cgroup.event_control.

And it may be the case (i haven't tried yet) that the container doesn't have permission to write to the cgroup filesystem. If this is the case, then this can really only be kubernetes that can configure the containers in a pod to be able to receive memory pressure notifications. I believe there may be some failsafes that prevent mounting the cgroupfs r/w within an unprivileged kubernetes pod - it wouldn't make much sense to let pods arbitrarily break their resource contracts. This particular path i think is safe though, so ideally at the least kubernetes shouldn't prevent the application from setting up the fd.

While not exactly a pre-oom handler, i think that a memory pressure handler inside of the container init based on the cgroup memory pressure can probably solve this problem. If the container init process is able to listen to this event file descriptor, then the container init process can add a generic "pre-oom" handler hook, to be triggered by the same memory controller that is sending this oom signal, as the pod approaches its cgroup memory limits.

Perhaps this can be prototyped with a priviliged pod, and an abstraction defined that requires that the container init process add support for monitoring this eventfd in order to use these notifications? If kubernetes currently blocks this, perhaps a patch can be submitted to "safelist" the path to a pods memory pressure notifier path?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

What makes this really interesting is sometimes you want the kernel to do the OOMKill when the node is under memory pressure, other times it woul dbe nice if kubernetes would preemptively SIGTERM as a OOMKill because the container is approaching a set limit

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Even better would be that OOM causes the call to brk to fail (aka malloc returning NULL), allowing the application to gracefully handle running out of memory the normal way instead of killing the process when it goes over.

lonre commented

It should provide opportunity to gracefully shutdown

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

/remove-lifecycle stale

for reasons outlined in #40157 (comment) we can't just change the signal delivered. OTOH we can integrate with some OOM daemons, but this would require a separate discussion and KEP.

Kubernetes does not use issues on this repo for support requests. If you have a question on how to use Kubernetes or to debug a specific issue, please visit our forums.

/remove-kind feature
/kind support
/close

@fromanirh: Closing this issue.

In response to this:

Kubernetes does not use issues on this repo for support requests. If you have a question on how to use Kubernetes or to debug a specific issue, please visit our forums.

/remove-kind feature
/kind support
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tsuna commented

@fromanirh this is not a support request, it's a legit feature request.

T3rm1 commented

@fromanirh Can you reopen this, please? It's clearly a feature request, not a support case.

aojea commented

for reasons outlined in #40157 (comment) we can't just change the signal delivered. OTOH we can integrate with some OOM daemons, but this would require a separate discussion and KEP.

it is a feature request for the Kernel not for Kubernetes , the kernel generates the SIGKILL

@aojea there non-kernel solutions here, such as triggering graceful shutdowns at a threshold memory usage (e.g. 95%) before the hard limit is reached.

aojea commented

@aojea there non-kernel solutions here, such as triggering graceful shutdowns at a threshold memory usage (e.g. 95%) before the hard limit is reached.

oh, that is not clear from the title and from the comments, sorry

for reasons outlined in #40157 (comment) we can't just change the signal delivered. OTOH we can integrate with some OOM daemons, but this would require a separate discussion and KEP.

so it should be retitled or open a new issue with the clear request ... and for sure that will need a KEP

Sure, how about titling it: "Add graceful memory usage-based SIGTERM before hard OOM kill happens"

Is there another Issue that's been recreated for this? I can't find one in the Issues list. If not, I can create a new feature request issue.

So an issue has been created already ? if so, would be nice to reference the link here.

@ffromani this is not a support request, this is a feature request that has been bumped for 6 years now, and i just found a usecase for, just as countless users can find usecases for here

Please reopen the issue

Adding my own voice -- a service being unable to shutdown gracefully can cause all kinds of harm, and causes harm for my organization as well. I thought kube already sent a SIGTERM first and was trying to substantiate that. Is this feature request still un-implemented? When memory limit is reached, pods are SIGKILLed immediately with no chance to shutdown gracefully?

/reopen
/kind feature

there's obvious interest in this feature, but would surely need a KEP and someone sheparding the feature

@ffromani: Reopened this issue.

In response to this:

/reopen
/kind feature

there's obvious interest in this feature, but would surely need a KEP and someone sheparding the feature

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

/remove-kind support
/triage accepted

FWIW my user story is:

I have a Ruby application (developed not by us) that has a slight memory leak over many days, it will balloon to a certain amount of memory, and then gradually keep adding memory.

For this, I started using oomhero to attempt to manage, but it does not seem to be perfect, nor always work. I want to set a hard limit on certain containers, and have applications get told a SIGTERM just before or after they reach that limit, so it can shut down normally and clean up.

Sometimes, these containers use imagemagick or ffmpeg to process some media, and these can balloon memory usage. The container can shut itself down and kick the job back to the queue when this happens, in a way that does not leave resident keepalive queues (sidekiq) and other things that need to be cleaned out every so often.


Implementation ideas: The container that has been told a SIGTERM for overriding the memory limit gets put first as a candidate for the OOMkiller, this could mean that, even if the application ignores the SIGTERM (or requires more resources for shutting down), it can overprovision the remaining system resources before being killed itself.

This could be a toggle per-container, from "hard" OOM kill to a "hard" OOM kill (default).