aws/amazon-vpc-cni-k8s

[EKS] Pods stuck in ContainerCreating status after upgrading to Kubernetes version 1.30

Gier32o opened this issue · 7 comments

Pods are stuck in ContainerCreating status after upgrading to Kubernetes version 1.30 on EKS.
We have 'Security Groups for Pods' feature turned on, and when we're trying to upgrade from:

ami_id             = "ami-066d744867bb80fce"
vpc_cni_version    = "v1.16.2-eksbuild.1"
kubernetes_version = "1.29"

to

ami_id             = "ami-05e7e986227a095a9"
vpc_cni_version    = "v1.18.2-eksbuild.1"
kubernetes_version = "1.30"

we're getting failing pods:

  Warning  FailedCreatePodSandBox  19m                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "cffd4f13c293011d5f6e967bd5859c234ab1f83731fbf1e40c46330e6276fdd7": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  66s (x85 over 19m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "38c6783c31d39443b9b0fe4873868fdf972c92d499176b5b44c9df42b4461865": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

There is no issue when 'Security Groups for Pods' feature is turned off
How to reproduce: https://github.com/Gier32o/k8s-upgrade-problem

Hello @Gier32o, does /var/log/aws-routed-eni/plugin.log or /var/log/aws-routed-eni/ipamd.log logs show any detailed about on the ip assignment or failure? Is aws-node pod running?
Usually during K8s upgrade, the CNI version does not change, we keep the CNI version same while performing K8s upgrade. After the k8s upgrade, you can do the CNI upgrade. Does this workflow give the desirable outcome?

Hi, the aws-node pods are running fine. You were right - upgrading addon version before or at the same time as kubernetes and worker AMIs results in this error. If I run upgrade in two batches: 1. (K8s + AMIs) -> 2. (Addon) it works fine. Thanks!
Is there any way to fix such a broken cluster afterwards?

Is there any way to fix such a broken cluster afterwards?

I am not sure what could have led to this stage. But you can downgrade the addon the previous version, and restart the pods, and upgrade the addons again.

Hi @orsenthil, I upgraded in the order you recommend, (K8s + AMIs) first, then Addon, but got the same problem. It even randomly failed, not all the time.

vpc_cni_version    = "v1.16.3-eksbuild.2"
kubernetes_version = "1.28"

to

vpc_cni_version    = "v1.18.2-eksbuild."
kubernetes_version = "1.29"

Error: updating EKS Add-On (test:vpc-cni): operation error EKS: UpdateAddon, https response error StatusCode: 400, RequestID: 608f24fe-795a-4c7c-acba-8d11836aa01b, InvalidParameterException: Addon version specified is not supported
when trying to downgrade the plugin 1.18.2 -> 1.16.3.

Nothing changed when I downgraded to 1.17.1