aws/amazon-vpc-cni-k8s

Upgrading from v1.16.0-eksbuild.1 to v1.17 or v1.18 results in failure to assign IP address to container

jdinsel-xealth opened this issue ยท 9 comments

What happened:

Upgrading from the v1.16.0 version to v1.17 or higher results in scheduled pods that cannot obtain an IP address. Downgrading back to v1.16.0 restores functionality. Also seen during this condition is that the EKS cluster does not scale out.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3a820c12790041f3e7e75e6a969a6c3e9fad7f9398fcc10b349ee17690e53b89": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

Attach logs

What you expected to happen:

Should be able to assign IP addresses to pod or the cluster should scale out and then be able to assign IP addresses to pods.

How to reproduce it (as minimally and precisely as possible):

On an EKS cluster running EKS 1.28, upgrade the VPC CNI add-on from v1.16 to v1.17 or v1.18. It may be necessary to add additional pods, but at some point, a pod will be assigned to an existing node and will sit in a pending state because aws-cni could not assign an IP address to the container.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.28
  • CNI Version: v1.17.1 or v1.18.0
  • OS (e.g: cat /etc/os-release): Amazon linux bottlerocket
  • Kernel (e.g. uname -a): bottlerocket-aws-k8s-1.28-x86_64-v1.19.2-29cc92cc

Do you see any errors in the ipamd.log that stand out?

Can you run node log collector - /opt/cni/bin/aws-cni-support.sh / https://github.com/awslabs/amazon-eks-ami/tree/master/log-collector-script/linux against the nodes and share the logs with us.

I'm having trouble gathering the logs. I walked through some steps when we were on v1.16.0 to get a feel for what was available. I found that I could access the ipamd.log in the aws-node when v1.16.0 was running, but could not get a shell when v1.18.0 was running. We're also using bottlerocket as the AMI and the steps in the linked collector resulted in more errors of missing commands than useful output. I was connected to the node where the IP could not be allocated and was unable to find information with the script or manually. Do you have any guidance on what I could do differently?

I tried to reproduce this issue using a new cluster and bottle-rocket image, but I could not.

  1. Setup 1.29 cluster with bottlerocket using https://github.com/eksctl-io/eksctl/blob/main/examples/20-bottlerocket.yaml
  2. Tested with CNI 1.16 - Scaled up pods, works
  3. Updated CNI to 1.17 - Scaled up pods, created new pods. Works.

To look into your bottlerocket logs.

Login to your instance and do sudo sheltie and access /var/log/aws-route-eni/ and logs in that directory. You could also ssm to your instance, and follow the prompts.

To permit more intrusive troubleshooting, including actions that mutate the
running state of the Bottlerocket host, we provide a tool called "sheltie"
(`sudo sheltie`).  When run, this tool drops you into a root shell in the
Bottlerocket host's root filesystem.
[ec2-user@admin]$ sudo sheltie

This failure add cmd: failed to assign an IP address to container can happen if sufficient ip address is not available on your instance. Ensure that you sufficient ips. The ipamd.log can provide some information regarding availability and assigned too.

Thanks, @orsenthil, for your guidance. I have reproduced the issue and submitted the ipamd.log to the triage email. There are messages logged that the ENI on the node "does not have available addresses" and "IP address pool stats: total 18, assigned 18" as well as "IP pool is too low: available (0) < ENI target (1) * addrsPerENI (9)" ... yet the cluster did not scale to create another node.

You mean, additional ENI wasn't created or the cluster didn't scale for another node, if later, it is auto-scaling functionality, not networking.

All that's changed is the VPC CNI driver. On 1.16, nodes scale out when nodes reach their limits; after, they do not and we see the error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3a820c12790041f3e7e75e6a969a6c3e9fad7f9398fcc10b349ee17690e53b89": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

I think I will close this after seeing no evidence that VPC CNI is not working as expected. Sure, there was an inability to add an IP, but I believe that is because the nodes were over subscribed and the cluster's auto scaler did not add a new node. I found a discrepancy in the number of EC2 instances in the node group and those seen in Kubernetes.

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.