aws/amazon-vpc-cni-k8s

Pulumi issue: aws-node pods are crashlooping

ba1ajinaidu opened this issue ยท 23 comments

What happened:
pods from aws-node daemonset are crashlooping on all the nodes, aws-node cluster role is missing a few rules related to policyendpoints. Adding the rules fixes the crashlooping, is there anyway to upgrade the CNI?

Attach logs

aws-node-zsv2p aws-eks-nodeagent W0215 09:17:02.372137       1 reflector.go:533] pkg/mod/k8s.io/client-go@v0.27.3/tools/cache/reflector.go:231: failed to list *v1alpha1.PolicyEndpoint: policyendpoints.networking.k8s.aws is forbidden: User "system:serviceaccount:kube-system:aws-node" cannot list resource "policyendpoints" in API group "networking.k8s.aws" at the cluster scope

What you expected to happen:
Pods shouldn't crashloop and clusterrole should have the necessary rules

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.28.5-eks-5e0fdde
  • CNI Version: 1.15.1
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):

@ba1ajinaidu how did you deploy v1.15.1? That chart has the proper cluster roles: https://github.com/aws/amazon-vpc-cni-k8s/blob/v1.15.1/config/master/aws-k8s-cni.yaml#L318

I'm not installing it explicitly, it's installed by default in the cluster

I'm not installing it explicitly, it's installed by default in the cluster

So you created a new EKS 1.28 cluster and immediately saw this issue when you enabled Network Policy?

Created the cluster a few days ago, but noticed the crashloops recently

@ba1ajinaidu - What is your EKS platform version? Also how did you create the EKS cluster? I just created a 1.28 cluster and no crash is seen with/without NP enabled..

@jayanthvn created the cluster using pulumi aws provider and the EKS platform version is eks.7

@ba1ajinaidu is it possible you are using some 3rd party security tool that manages CRDs and cluster roles? This role is present in the chart that gets installed, so it seems like the only way it could be missing is if some 3rd-party tool is removing it.

I haven't installed any 3rd party tools on the cluster, I'll try with a new cluster and see how that goes.
Just curious, is the CNI version fixed based on the k8s version and eks platform version? can I upgrade the chart somehow?

I haven't installed any 3rd party tools on the cluster, I'll try with a new cluster and see how that goes. Just curious, is the CNI version fixed based on the k8s version and eks platform version? can I upgrade the chart somehow?

The VPC CNI is an addon, so it can be updated at any time. This public doc walks through how to create and manage it with the EKS addon API, which is what we recommend: https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html

@jayanthvn created the cluster using pulumi aws provider and the EKS platform version is eks.7

I'm also seeing this issue with a new cluster installed with Pulumi. Were you able to resolve?

@mjdouble @ba1ajinaidu Can you try creating a cluster with eksctl (or some other provider) to see if you run into the same issue. I cannot reproduce this, which makes me believe that this is a bug in the Pulumi workflow. Have you tried asking there?

same issue hit
EKS 1.26 x AddonVersion v1.15.1-eksbuild.1

@jun0tpyrc using eksctl or Pulumi?

Running into this issue with amazon-k8s-cni:v.12.5 and kubernetes version: v1.26.12-eks-5e0fdde
Cluster created with pulumi.

All evidence points towards this being a Pulumi problem, so I highly suggest anyone facing this to raise an issue there.

@jdn5126 Having the same issue here, and I will open an issue.

Yes.. this was a pulumi issue. FWIW it seems resolved for me now with latest pulumi updates...

If still having the issue check with these versions:

@pulumi/pulumi 3.107.0
@pulumi/eks 2.2.1
@pulumi/aws 6.23.0

@mjdouble and @jdn5126: Can confirm that with the AWS >= v6 and EKS >= v2 the problem is not appearing.

Suggest to upgrade (as I did)

image
image

Closing this as Pulumi issue. If there is any changes in vpc cni that can help avoid this, let us know.

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

Im using v1.11.3 and it doesn't require those missing permission. see the doc

But what I noticed the new cluster that Im creating using same IaC, the aws-node container is getting a extra env var i.e. VPC_CNI_VERSION = v1.15.1 and VPC_ID = vpc-xxxxxxxxx

Not sure how it getting added. Still a mystery.

Im not using pulumi, using plain old terraform module

If it is using v1.15.1 then we need to fix permission as per the doc

Versions:

eks: v1.25.16-eks-77b1e4e