aws/amazon-vpc-cni-k8s

v1.16.3 High CPU Usage on Medium Instance Size

aballman opened this issue · 26 comments

What happened:

I updated to v1.16.3. Specifically on medium instance size CPU usage by aws-node pod is maxed out

On a t3.medium (I have also seen this happen on a c6g.medium)
Screenshot 2024-02-22 at 1 41 19 PM

On an m7i.2xlarge
Screenshot 2024-02-22 at 1 42 14 PM

Attach logs

This log is spewing constantly in /var/log/aws-routed-eni/ipamd.log on the medium other instance sizes are typical volume of logs

{"level":"debug","ts":"2024-02-22T21:34:38.560Z","caller":"ipamd/ipamd.go:779","msg":"Node found \"ip-10-146-63-142.ec2.internal\" - no of taints - 0"}
{"level":"debug","ts":"2024-02-22T21:34:38.560Z","caller":"ipamd/ipamd.go:658","msg":"Skipping ENI allocation as the max ENI limit is already reached"}
{"level":"debug","ts":"2024-02-22T21:34:38.560Z","caller":"ipamd/ipamd.go:658","msg":"Starting to increase pool size"}
{"level":"debug","ts":"2024-02-22T21:34:38.560Z","caller":"ipamd/ipamd.go:779","msg":"Node found \"ip-10-146-63-142.ec2.internal\" - no of taints - 0"}
{"level":"debug","ts":"2024-02-22T21:34:38.560Z","caller":"ipamd/ipamd.go:658","msg":"Skipping ENI allocation as the max ENI limit is already reached"}
{"level":"debug","ts":"2024-02-22T21:34:38.560Z","caller":"ipamd/ipamd.go:658","msg":"Starting to increase pool size"}
{"level":"debug","ts":"2024-02-22T21:34:38.560Z","caller":"ipamd/ipamd.go:779","msg":"Node found \"ip-10-146-63-142.ec2.internal\" - no of taints - 0"}
{"level":"debug","ts":"2024-02-22T21:34:38.560Z","caller":"ipamd/ipamd.go:658","msg":"Skipping ENI allocation as the max ENI limit is already reached"}
{"level":"debug","ts":"2024-02-22T21:34:38.560Z","caller":"ipamd/ipamd.go:658","msg":"Starting to increase pool size"}

What you expected to happen:
Typical CPU usage for medium instances

How to reproduce it (as minimally and precisely as possible):
Spin up a medium size in an eks cluster with v1.16.3 cni. In my case it's a t3.medium

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.29.0-eks-c417bb3
  • CNI Version: v1.16.3
  • OS (e.g: cat /etc/os-release): Bottlerocket OS 1.19.1 (aws-k8s-1.29)
  • Kernel (e.g. uname -a): 6.1.72

Great debugging, this looks like a bug in the logic for rapid scaling: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L660

The code seeks to assign IPs until the desired state can be met, but if no more ENIs can be attached to the node, this for loop will spin indefinitely...

This is an unfortunate miss. While we fix this, I recommend downgrading to v1.16.2. Can you also let us know the following:

  • what warm/minimum target values you have set
  • whether you have any unmanaged ENIs attached
  • whether you have Custom Networking or Security Groups for Pods enabled

Thanks for the quick response!

WARM_ENI_TARGET: 0
WARM_IP_TARGET: 10

No additional ENI attached
No custom networking or security groups for pods
Network Policy Agent is enabled

@aballman I did a bit more digging and have a fix to prevent the livelock, but what is curious is why the datastore is still returning "too low" in your case. When all ENIs are attached and all IPs are assigned to those ENIs (including warm IPs), then max pods should be hit and the datastore should stop trying to allocate IPs. It should actually happen sooner than max, since some pods run in host networking namespace.

For t3.medium, after 18 IPs are allocated (3 ENIs*6 IPs per ENI), the datastore should never return "too low", as max pods should be set to 18. Can you check what --max-pods is set to on the node? One way to do this is to just run systemctl status kubelet -l on the instance and see the argument passed to kubelet.

My clusters are setup with karpenter and have a lot of autoscaling happening. I don't have any medium instances at the moment. I'll see if I can get one launched and report back.

The KubeletConfiguration file specifies maxPods: 35 on ami-0f623fc60e44e28fc / v1.29.0-eks-a5ec690 for a t3.medium

The KubeletConfiguration file specifies maxPods: 35 on ami-0f623fc60e44e28fc / v1.29.0-eks-a5ec690 for a t3.medium

Well that is definitely wrong. An instance cannot support more pods than it has IPs available to assign to them. We should track that issue separately, where there are two options:

  1. You file an AWS support case in your console, which will get the EKS nodes team involved
  2. You create an issue at https://github.com/awslabs/amazon-eks-ami/issues, which will also get the EKS nodes team involved, but likely more slowly

The KubeletConfiguration file specifies maxPods: 35 on ami-0f623fc60e44e28fc / v1.29.0-eks-a5ec690 for a t3.medium

Well that is definitely wrong. An instance cannot support more pods than it has IPs available to assign to them. We should track that issue separately, where there are two options:

  1. You file an AWS support case in your console, which will get the EKS nodes team involved
  2. You create an issue at https://github.com/awslabs/amazon-eks-ami/issues, which will also get the EKS nodes team involved, but likely more slowly

I'm using bottlerocket, would that be the same repo to report the issue to?
Thanks for your help here

I'd like to report that we are also seeing this on v1.16.3, right after an upgrade to EKS 1.26 and AWS VPC CNI 1.16.3, it's a night and day difference, with the exact same error. The arrow notes the moment we upgraded.

image

The VPC CNI pods were restarted and it seems the CPU usage "normalized" by itself, I can't explain what changed aside from the pods no longer being in CrashLoopBackOff:

> kc get pods -l k8s-app=aws-node
NAME                                                      READY   STATUS             RESTARTS        AGE
aws-node-2tj8f                                            2/2     Running            0               169m
aws-node-n42kk                                            2/2     Running            1 (87m ago)     16h
aws-node-v2ggb                                            1/2     CrashLoopBackOff   31 (115s ago)   3h53m

> kc logs aws-node-v2ggb
Defaulted container "aws-node" out of: aws-node, aws-eks-nodeagent, aws-vpc-cni-init (init)
Installed /host/opt/cni/bin/aws-cni
Installed /host/opt/cni/bin/egress-cni
time="2024-02-27T19:40:05Z" level=info msg="Starting IPAM daemon... "
time="2024-02-27T19:40:05Z" level=info msg="Checking for IPAM connectivity... "
time="2024-02-27T19:40:08Z" level=info msg="Copying config file... "
time="2024-02-27T19:40:08Z" level=info msg="Successfully copied CNI plugin binary and config file."
time="2024-02-27T19:42:17Z" level=error msg="Failed to wait for IPAM daemon to complete" error="signal: killed"

> kc delete pod aws-node-v2ggb
pod "aws-node-v2ggb" deleted

> kc top pod -l k8s-app=aws-node
NAME             CPU(cores)   MEMORY(bytes)   
aws-node-2tj8f   4m           64Mi            
aws-node-n42kk   1203m        48Mi            
aws-node-nvqg7   1m           15Mi

This leads me to conclude this issue is a bit... Intermittent? I'm more inclined to believe when some pods were moved to different nodes and that reduces the number of IPs needed, solving the above issue.

We use the default EKS AMI and a bog standard installation of the VPC CNI too, our only configuration is related to resources:

{"resources":{"limits":{"memory":"128Mi"},"requests":{"cpu":"5m","memory":"50Mi"}}}

We'll attempt the downgrade and report back further.

@aballman ah good catch, so I am not sure if https://github.com/bottlerocket-os/bottlerocket/issues is the right repo to report this issue or not. Would you mind filing an AWS support case? That will definitely get it routed to the right team, albeit a bit more slowly than the convenience of GitHub Issues.

@Angelin01 which instance type are you seeing this on? And are you using Bottlerocket as well?

As a side note, #2810 should merge soon and be released in v1.16.4, which is targeting mid-March.

@jdn5126
We use SpotInst Ocean as a means of using Spot Instances, but the underneath this is a standard EC2 with the standard EKS AMI (ami-0d00c9f7217e4079c). But our instances are mixed, large:

image

We do not use bottlerocket.

I also talked to another team that just created a test EKS cluster today on 1.28, not using SpotInst ocean and instead using AWS' Node groups, same issue.

@Angelin01 Got it, thanks for confirming. Seems there are more edge cases than just max pods, so we'll have to push to get v1.16.4 out sooner.

For reference, these nodes have the following number of pods, respectively:

  • 54
  • 55
  • 39

Two of these are VERY close to the limit here, so it may indeed be related to max pods.

But yes, I'd still verify other cases, as I said a brand new cluster experienced the same issue. I don't know if you have a "yank" mechanism, but I'd pull the 1.16.3 release.

Confirmed downgrade to v1.16.2 fixes the issue:

> kc get pods -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-5szp6   2/2     Running   0          7m54s
aws-node-d5cbs   2/2     Running   0          7m19s
aws-node-vlqw5   2/2     Running   0          7m45s
> kc top pod -l k8s-app=aws-node
NAME             CPU(cores)   MEMORY(bytes)   
aws-node-5szp6   5m           59Mi            
aws-node-d5cbs   4m           57Mi            
aws-node-vlqw5   4m           43Mi 

image

@jdn5126 I appreciate the fast responses, I hope the fix doesn't prove to be too problematic. I also apologize for the comment spam.

@Angelin01 happy to help, this is an unfortunate regression that we need to make sure we have coverage for in the future. We do not have a "yank" mechanism, just documentation to recommend against updating or updating with caution. And then hopefully a quick turnaround in releasing a patched version.

@aballman I spoke briefly to the EKS Nodes team, and it looks like the incorrect max pods is a Bottlerocket issue, since they set KubeletConfiguration. So we can file an issue at https://github.com/bottlerocket-os/bottlerocket/issues. When you spin up t3.medium with the EKS optimized AMI, max pods is set to 17 as expected.

@jdn5126 Thanks, the follow up is much appreciated. I'll file an issue there.

We have just bumped into this issue. We run a fleet of 100+ EKS clusters and we upgraded the CNI plugin to the latest version in about a dozen of them. Initially we didn't see any issue, but after a couple of days we noticed that our logging ingestion (and cost) had gone thru the roof (we have centralized logging for /var/log/aws-routed-eni/ipamd.log) and the aws-node pod in these clusters was using WAY too much CPU.

We released an hot-patch to pin the version of the AWS CNI plugin to v1.16.2 across the whole fleet to prevent accidental upgrades in the plugin's version. But we prefer to not have to pin the versions of the AWS provided add-ons as they end up not being updated and over time, rot.

We run the AWS provided EKS AMI, not bottlerocket.

We do not have a "yank" mechanism, just documentation to recommend against updating or updating with caution. And then hopefully a quick turnaround in releasing a patched version.

For those, like us, using the AWS EKS Add-ons to manage this sort of components, this feels a sub-par solution. I think that the operators that handover the responsibility of managing this components to AWS are expecting a quick fix-process for identified issues like this one.
If a yank mechanism is not available, a alternative would be a new release with a patch for the bug. Could that be a possibility?

As a side note, #2810 should merge soon and be released in v1.16.4, which is targeting mid-March.

Mid-March can be about two weeks to go - there's still potential for v1.26.3 to bring some headaches to EKS operators.

@PerGon Yep, we fully understand and are working on a patch release as soon as possible, so it will be before March 15th. Pulling a release from the field is a bad idea for many reasons, so our primary mechanisms for alerting are EKS console notices and release notes: https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.3

Understandable. Hence the suggestion:

an alternative would be a new release with a patch for the bug

Releasing something like a v1.16.3.1 or a v1.16.4 that would only include the patch. The point being, doing a release so that v1.16.3 stops being the latest. I would guess that the latest version has a much higher probability of being used than the second to last.

But I understand a lot of other things could be in the way of doing that. Was just thinking out loud to try to help reducing the chance others would face the same issue by upgrading to the latest version.
But I've seen the PR with the fix has been merged, fingers crossed a new release will come out soon 🤞

But I've seen the PR with the fix has been merged, fingers crossed a new release will come out soon 🤞

yes, we are targeting for a quick release.

Can confirm rolling back to v1.16.2-eksbuild.1 solved the issue for us.
We were seeing CPU Usage of around ~1200 milliCPU on v1.16.3-eksbuild.1 compared to ~3 milliCPU on v1.16.2-eksbuild.1.

Screenshot 2024-03-03 at 19 08 55

For those using terraform-eks-module, the following configuration will pin the VPC-CNI EKS add-on to v1.16.2-eksbuild.1

 cluster_addons = {
    vpc-cni = {
      # https://github.com/aws/amazon-vpc-cni-k8s/issues/2807
      addon_version = "v1.16.2-eksbuild.1"
      # most_recent                 = true
    }
  }

We have this problem on 1.29 with m5.large instances (ami-0a5010afd9acfaa26). Downgrading to "v1.16.2-eksbuild.1" helped instantly.
Screen Shot 2024-03-05 at 7 59 02 PM

VPC CNI v1.16.4, which contains the fix for this issue, is now released: https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.16.4

The EKS Managed Addon should be available in all regions within the next 48 hours.

I am going to close this issue now that v1.16.4 is available, but please feel free to add any comments.

I am going to close this issue now that v1.16.4 is available, but please feel free to add any comments.

We are not seeing v1.16.4 available in the EKS Add-on versions

image

Do you know when this is expected to ship to the EKS Add-on?

@jeremyruffell it takes ~24-48 hours after the GitHub release for the EKS Managed Addon to be available in all regions, with us-east-1 being the last region.