aws-node-* pods in "CrashLoopBackOff" with error in their logs "Failed to wait for IPAM daemon to complete"

Question

aws-node-* pods in "CrashLoopBackOff" with error in their logs "Failed to wait for IPAM daemon to complete"

chase-replicated opened this issue a year ago · 7 comments

What happened:
I created the managed addon for the EKS VPC CNI after removing the weave-net CNI I previously had installed. Even after creating all new nodes, all the aws-node pods created by the vpc cni daemonset are crashlooping shortly after starting:

Installed /host/opt/cni/bin/aws-cni
Installed /host/opt/cni/bin/egress-v4-cni
time="2023-06-14T22:57:38Z" level=info msg="Starting IPAM daemon... "
time="2023-06-14T22:57:38Z" level=info msg="Checking for IPAM connectivity... "
time="2023-06-14T22:57:39Z" level=info msg="Copying config file... "
time="2023-06-14T22:57:39Z" level=info msg="Successfully copied CNI plugin binary and config file."
time="2023-06-14T22:57:39Z" level=error msg="Failed to wait for IPAM daemon to complete" error="exit status 1"

Attach logs

I can't do this because my nodes don't have public IP addresses, none of my running pods have ssh installed on their containers, and I can't create a new pod because the CNI isn't working, so it will never come up.
Was able to get in via a bastion host in EC2, sent email with logs.

What you expected to happen:
Pods run

How to reproduce it (as minimally and precisely as possible):
Install Weave-net CNI on EKS cluster v1.23
delete weave-net cni
recreate all nodes
Install AWS VPC CNI 1.12.6
pods start crashlooping

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): Client Version: v1.26.1
Kustomize Version: v4.5.7
Server Version: v1.23.17-eks-0a21954
CNI Version 1.12.6
OS (e.g: cat /etc/os-release): can't get to nodes
Kernel (e.g. uname -a): can't get to nodes

Answer 1 · 2023-06-15T15:44:36.000Z

@chase-replicated I do not know anything about weave-net CNI, but it sounds like that plugin may be leaving some artifacts or changing something in the file system that VPC CNI cannot handle. Are you able to SSM from your AWS console to the node? The important logs here would be in /var/log/aws-routed-eni/.

Answer 2 · 2023-06-15T15:53:49.000Z

@chase-replicated I do not know anything about weave-net CNI, but it sounds like that plugin may be leaving some artifacts or changing something in the file system that VPC CNI cannot handle. Are you able to SSM from your AWS console to the node? The important logs here would be in /var/log/aws-routed-eni/.

I was able to ssh to the node via a bastion host, correction was made in my original comment. I sent the logs to the email specified

Answer 3 · 2023-06-15T15:55:42.000Z

@chase-replicated I do not know anything about weave-net CNI, but it sounds like that plugin may be leaving some artifacts or changing something in the file system that VPC CNI cannot handle. Are you able to SSM from your AWS console to the node? The important logs here would be in /var/log/aws-routed-eni/.

Sorry, meant to say I sent the output of the script mentioned in the ticket instructions to the specified email. I'm not sure which logs I should be looking at here as there's a bunch of them:
egress-v4-plugin.log ipamd.log plugin-2023-06-15T04-49-05.069.log plugin-2023-06-15T04-49-05.069.log.gz plugin-2023-06-15T11-17-07.066.log.gz plugin.log

Answer 4 · 2023-06-16T16:58:22.000Z

Turns out I just had an incorrect service account name present in the trust relationship for the IAM role, so after updating that it's working fine.

Answer 5 · 2023-06-16T16:58:52.000Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Answer 6 · 2023-06-19T23:19:51.000Z

Turns out I just had an incorrect service account name present in the trust relationship for the IAM role, so after updating that it's working fine.

Could you elaborate on what is the correct trust relationship for the IAM role ? I'm having the same issue with v12 and the new released v13. But it only happens on the latest Kubernetes 1.27, while in 1.26 the addon works fine without problems.

Answer 7 · 2023-06-22T01:02:19.000Z

Turns out I just had an incorrect service account name present in the trust relationship for the IAM role, so after updating that it's working fine.

Could you elaborate on what is the correct trust relationship for the IAM role ? I'm having the same issue with v12 and the new released v13. But it only happens on the latest Kubernetes 1.27, while in 1.26 the addon works fine without problems.

I'm on 1.23, so unlikely to be the same problem. I had to set my trust relationship to target the name of the service account I had created via terraform. I had previously had it set to an arbitrary non-existent service account name.