[EKS] Pods stuck in ContainerCreating status after upgrading to Kubernetes version 1.30

Question

[EKS] Pods stuck in ContainerCreating status after upgrading to Kubernetes version 1.30

Gier32o opened this issue 3 months ago · 7 comments

Pods are stuck in ContainerCreating status after upgrading to Kubernetes version 1.30 on EKS.
We have 'Security Groups for Pods' feature turned on, and when we're trying to upgrade from:

ami_id             = "ami-066d744867bb80fce"
vpc_cni_version    = "v1.16.2-eksbuild.1"
kubernetes_version = "1.29"

to

ami_id             = "ami-05e7e986227a095a9"
vpc_cni_version    = "v1.18.2-eksbuild.1"
kubernetes_version = "1.30"

we're getting failing pods:

  Warning  FailedCreatePodSandBox  19m                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "cffd4f13c293011d5f6e967bd5859c234ab1f83731fbf1e40c46330e6276fdd7": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  66s (x85 over 19m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "38c6783c31d39443b9b0fe4873868fdf972c92d499176b5b44c9df42b4461865": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

There is no issue when 'Security Groups for Pods' feature is turned off
How to reproduce: https://github.com/Gier32o/k8s-upgrade-problem

Answer 1 · 2024-06-27T18:55:28.000Z

Hello @Gier32o, does /var/log/aws-routed-eni/plugin.log or /var/log/aws-routed-eni/ipamd.log logs show any detailed about on the ip assignment or failure? Is aws-node pod running?
Usually during K8s upgrade, the CNI version does not change, we keep the CNI version same while performing K8s upgrade. After the k8s upgrade, you can do the CNI upgrade. Does this workflow give the desirable outcome?

Answer 2 · 2024-06-28T11:57:14.000Z

Hi, the aws-node pods are running fine. You were right - upgrading addon version before or at the same time as kubernetes and worker AMIs results in this error. If I run upgrade in two batches: 1. (K8s + AMIs) -> 2. (Addon) it works fine. Thanks!
Is there any way to fix such a broken cluster afterwards?

Answer 3 · 2024-06-28T16:30:38.000Z

Is there any way to fix such a broken cluster afterwards?

I am not sure what could have led to this stage. But you can downgrade the addon the previous version, and restart the pods, and upgrade the addons again.

Answer 4 · 2024-06-29T10:46:10.000Z

Hi @orsenthil, I upgraded in the order you recommend, (K8s + AMIs) first, then Addon, but got the same problem. It even randomly failed, not all the time.

vpc_cni_version    = "v1.16.3-eksbuild.2"
kubernetes_version = "1.28"

to

vpc_cni_version    = "v1.18.2-eksbuild."
kubernetes_version = "1.29"

Answer 5 · 2024-07-01T08:12:52.000Z

Error: updating EKS Add-On (test:vpc-cni): operation error EKS: UpdateAddon, https response error StatusCode: 400, RequestID: 608f24fe-795a-4c7c-acba-8d11836aa01b, InvalidParameterException: Addon version specified is not supported
when trying to downgrade the plugin 1.18.2 -> 1.16.3.

Nothing changed when I downgraded to 1.17.1