aws/amazon-vpc-cni-k8s

Tolerations change causes cni not configured/initialized errors

aburan28 opened this issue · 6 comments

During a recent upgrade of the vpc cni in our EKS environment from 1.12.6 to 1.15.0 we encountered cni not configured/initialized errors. Upon further review we noticed that our upgrade missed this change #2345

The aws-vpc-cni chart had a tolerations parameter in the values.yaml that was set to tolerations: []. On the actual daemonset it hard coded the tolerations to ensure it ran on all nodes here.

Then in 1.13 the values changed the values.yaml for tolerations to be

tolerations: 
- operator: Exists

and configured that value to be interpolated on the daemonset.
So in the case where developers are using a values.yaml file from the <1.12.6 helm charts where tolerations is set to [] but are now installing the chart v1.13.0+ they will encounter the cni not initialized issue as the aws-node pods will not run on all the nodes.

@aburan28 when upgrading the VPC CNI, you need to deploy the entire chart or manifest. values.yaml files from v1.12.6 are not guaranteed to be compatible with v1.15.0.

Understand that and this was a miss on my end admittedly. However I'm raising it here because this is 100% going to cause issues for other folks and it will not be apparent what is causing this to happen.

Understand that and this was a miss on my end admittedly. However I'm raising it here because this is 100% going to cause issues for other folks and it will not be apparent what is causing this to happen.

I see, but I am not sure what we could possibly do here to help prevent this, as we already call this out in the release notes for each version, and we try to minimize chart changes as much as possible. Helm just sees raw YAML files, so it can't really help us either. I think the best step is to always use the helm chart published upstream, i.e.:

helm repo add eks https://aws.github.io/eks-charts
helm install aws-vpc-cni --namespace kube-system eks/aws-vpc-cni

A potential improvement here could be to check if the tolerations are set to blank in the values.yaml file and if that is the case render the old default of

tolerations:
- operator: Exists

Sorry @aburan28, but that's not really the model for helm, so I don't think we want to make a change here. The goal is to have defaults specified in values.yaml, and then to allow for those defaults to be overridden and rendered correctly via the template. Having the default in two locations would make this more confusing, and it would override someone that wants the tolerations to be empty for whatever reason.

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.