GoogleCloudPlatform/container-engine-accelerators

nvidia-gpu-device-plugin gets OOM killed

omesser opened this issue ยท 12 comments

Hey folks,
In our experiments with running GPU loads over GKE over at iguazio we've hit OOM kill of the NVIDIA GPU device plugin pod during a gpu load test

$ kubectl -n kube-system describe pod nvidia-gpu-device-plugin-ngkrv | grep OOM -A15
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 27 Jul 2021 11:15:02 +0300
      Finished:     Tue, 27 Jul 2021 16:49:32 +0300
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     50m
      memory:  20Mi
    Requests:
      cpu:     50m
      memory:  20Mi
    Environment:
      LD_LIBRARY_PATH:  /usr/local/nvidia/lib64
    Mounts:
      /dev from dev (rw)

This happened on ubuntu nodes w GPU (n1-standard-16 though I don't think it matters ๐Ÿ˜„ ), running GKE engine 1.19.9-gke.1900

I suspect the allocated resources for it (mem limits) might not be enough. Maybe up it to 40Mi?
We're using the documented way of installing it as describe in https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu

@omesser, Thanks for reporting this issue. On GKE 1.19, you can increase the GPU device plugin memory limit with kubectl -n kube-system edit ds nvidia-gpu-device-plugin, and update the memory request and limit. Form GKE 1.20 onwards, we've already increased the memory limit to avoid OOMs.

Hi @pradvenkat,
Thanks for the reply
This is exactly what we're doing (increasing the limit manually).
Any reason why not to change the manifest for GKE engine 1.19.x so... you know, users wouldn't have to do that? sounds like an easy and safe fix to get this out of the box

Encountered the same issue on 1.19.x
Upgraded the cluster to 1.20.8-gke.900.
After installation, the memory of the nvidia-device-plugin is still 20mb.

This is also what I see in the source.

Here is the graph of memory utilization over time after I set the limit to 200 Mb.
I believe the Y axis is fraction of the limit, which means the container requires 50MB at least, plus some for headroom...
image

Out of memory on the driver can cause, if the timing is right, for running pods to fail in outofnvidia.com/gpu status
since the GPU of the node "disappears" for the Kubernetes scheduler

I am altering the memory manually for now, but it should be tested and fixed.

@omesser, Thanks for reporting this issue. On GKE 1.19, you can increase the GPU device plugin memory limit with kubectl -n kube-system edit ds nvidia-gpu-device-plugin, and update the memory request and limit. Form GKE 1.20 onwards, we've already increased the memory limit to avoid OOMs.

I noticed that editing the daemonset manually, setting the container's limits.memory gets automatically overwritten after a few seconds, back to the original value (20Mi).
from looking at the audit log of the daemonset, it seems that system:addon-manager is doing some automatic reconciliation in the background.

If I delete the daemonset --> also gets automatically recreated
applying a fresh yaml --> gets overwritten

What's the right way of changing this value so that it doesn't get overwritten?

Hi @assapin
This is managed by the addon manager, and, to my understanding, the labels are descriptive (see: https://discuss.istio.io/t/editing-istio-as-a-gke-add-on/963/2) so if you're in Reconcile mode, you're out of luck :\

We're installing the gpu device plugin via the installer which might be different from how you install it.

And when running

kubectl -n kube-system get ds nvidia-gpu-device-plugin -o yaml

What I'm seeing on our gke clusters is

  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
    k8s-app: nvidia-gpu-device-plugin

So, in my clusters, changes to the daemonset persist, and addon manager isn't ruining my day and overriding the limit values.
If you'll see this: addonmanager.kubernetes.io/mode=Reconcile that explains your behavior I guess.

From what I gather there's nothing you can do except change the way you install the gpu plugin so you'll have addonmanager.kubernetes.io/mode: EnsureExists like we do, or just, yell at GKE support really loudly :(

@assapin Sorry for the inconvenience this brings. For the memory limit fix (we change the memory limit from 20MiB to 50 MiB), it was initially pushed to 1.21 version first (all 1.21 version currently in Rapid channel contains this fix), then backported to 1.20. We backported it half month ago to 1.20 version. However, checked our recent release, this backported version is not available in 1.20 currently (the closest 1.20 release which will include the fix should be 1.20.9-2100, and will be available in next week)

Starting from 1.20, the addonmanager changed from EnsureExists to Reconcile (this change is necessary for us to make the device plugin automatically updated to use new features together with the cluster upgrade), the side effect is manual editing will be reverted from the device plugin YAML.

To quickly change the memory limitation, I would suggest to upgrade your cluster again to 1.21 version, all 1.21 versions should contains this fix. Again, sorry for the inconvenience it brings. Please let us know if 50 MiB is not enough for your case (from the memory utilization you includes, 50 MiB may not be sufficient either...)

@assapin - I can indeed verify the addon manager mode here is version dependent (so, imo my initial assumption that it's somehow related to how I installed it was wrong).
These are the versions I tested today and the addon manager modes (and mem limits) on the nvidia gpu device plugin:

v1.17.17-gke.9100 - Reconcile (mem limits 10M)
v1.19.13-gke.1200 - EnsureExists (mem limits 20M)
v1.20.9-gke.1000 - Reconcile (mem limits 20M)

@grac3gao
Thanks for getting back to us,

Starting from 1.20, the addonmanager changed from EnsureExists to Reconcile (this change is necessary for us to make the device plugin automatically updated to use new features together with the cluster upgrade), the side effect is manual editing will be reverted from the device plugin YAML....I Would suggest to upgrade your cluster again to 1.21 version"

Not everyone can/wants to upgrade k8s version to 1.21 (or generally speaking) willy-nilly, and things like this are true blockers for stable/enterprise environments (even in this specific scenario/issue as you said, 50MiB might not be enough, but there are always other problematic parameters people would need to tweak).
It makes sense that people want some degree of control over such things.
Can you please provide some manual procedure to change the addon manager's source yamls directly, or somehow change the addon manager mode to EnsureExists to make edits stick, so users can work around this and similar issues on their clusters ? it's needed on the nvidia-gpu-device-plugin here, and from what I see around the web, this is also sorely needed for other addons (like istio).
Thanks!

Yes, I saw this discussion on Istio and tracked down the issue to the "Reconcile" label, that causes the addon manager to re-apply the configuration it has stored locally on the master(?)

@omesser great that you've managed to find a specific version that comes with EnsureExists.
Seems to me the memory config. on this plugin is changing between versions and is elusive to get it right.
So yes, like you, I would prefer a version that lets me have some degree of control on this memory setting.
like 1.19.13-gke.1200 you mentioned.

@grac3gao
I am for automagical cluster management features - in AWS I have to install everything manually -
but when they fail and I can't do anything to fix them it's seriously frustrating...

@grac3gao - ping. Any followup here?

Sorry for the late response. For the nvidia-device-plugin OOM issue, we have the following mitigations and plans :

  1. For nvidia-device-plugin in cluster with k8s version earlier or equals to 1.19, users can directly edit the memory limit of the nvidia-device-plugin daemonset.
  2. For nvidia-device-plugin in cluster with k8s version later than 1.19, because of the addon manager mode change, users are not able to directly edit the memory limit. To solve the OOM issue, we have increased the memory limit from 20 MB to 50MB. Currently, any versions later than 1.20.9-gke.1100 contains this fix. (available in regular channel with 1.20.10-gke.301 and in rapid channel with 1.21.4-gke.301 and 1.21.4-gke.1801)
  3. For users don't want to upgrade clusters or find 50 MB is still insufficient, the following workaround will be a short-term mitigation:
  • Create a copy of the nvidia-gpu-device-plugin DaemonSet in which you change: name of the daemon set node selector, to for example 'cloud.google.com/gke-accelerator-modified' and the memory limit to something higher
  • On affected node, using kubectl label node to change the label from 'cloud.google.com/gke-accelerator' to 'cloud.google.com/gke-accelerator-modified'.
  • Modify any workloads you have, which required the 'cloud.google.com/gke-accelerator' label to work with the new 'cloud.google.com/gke-accelerator-modified' label. (this label indicates which accelerator type you are using, most likely, you will probably use this label as a node selector to schedule your workloads to a specific type of node)

Extended from the workarounds in point 3, we are planning to pre-configure several nvidia-gpu-device-plugin daemonsets with different memory limit level. In the future, users are able to change some node labels to switch to use a nvidia-gpu-device-plugin daemonsets with proper memory limit which accommodates their workloads.

@grac3gao-zz
Thanks for the response, unfortunately, not good news I would say

"Modify any workloads you have, which required the 'cloud.google.com/gke-accelerator' label to work with the new 'cloud.google.com/gke-accelerator-modified' label"

Obviously that is quite a hacky way around this. I would not want to go with such a solution, having to not consider such nodes as ones with cloud.google.com/gke-accelerator just because of arbitrarily low settings of memory limit.

"we are planning to pre-configure several nvidia-gpu-device-plugin daemonsets with different memory limit level"

I'll be waiting to see what this is all about then, I suppose, Though I can't say I understand why you would go with this solution instead of either upping the limit to something which is safely upwards of real life usage under high loads, or providing a way to interact with the addon manager in a meaningful way to configure things like that.

Anyways,
Cheers and thanks for answering