[aws-vpc-cni] EKS-preinstalled VPC CNI switched to being deployed by Helm without a way to determine that the deployment is AWS-managed

Question

[aws-vpc-cni] EKS-preinstalled VPC CNI switched to being deployed by Helm without a way to determine that the deployment is AWS-managed

AndiDog opened this issue 6 months ago · 8 comments

Describe the bug

I'm describing EKS behavior here. Didn't know where else to put this in order to bring it to the attention of AWS EKS engineers. Please move the issue if desired.

The AWS VPC CNI is preinstalled by default with a newly-created EKS cluster, and currently there's no way to make its installation optional in order to install only a custom CNI (e.g. Cilium, Calico, ...). Now it seems that EKS behavior of the CNI deployment has changed from kubectl apply style to being installed by Helm (incl. the "managed by" label). This is a problem for Kubernetes Cluster API Provider AWS (issue) because that operators wants to delete the AWS VPC CNI if a custom CNI is desired. It therefore has to determine whether the preinstalled AWS VPC CNI resources are AWS-managed, and if yes, deletes them. Since the preinstalled stuff is now Helm-managed, there's no way to tell if the chart was deployed by AWS automatically, or by a cluster admin.

One solution could be to add a certain label that states "this Helm chart is managed / pre-installed by AWS".

This breaking change in behavior even applies to older EKS Kubernetes versions such as 1.24, and therefore led to disruption of the described feature.

Steps to reproduce

Use Cluster API Provider AWS (CAPA) to create a new EKS cluster with default settings. It should also be the same without that operator, e.g. via the AWS Console, but I didn't test that.
Log into the new cluster and run kubectl get ds -n kube-system aws-node -o yaml. The "managed by" Helm label is there.

Environment

Chart name: aws-vpc-cni
Chart version: As pre-installed for a new EKS cluster
Kubernetes version: 1.24, likely also all others
Using EKS (yes/no), if so version? – yes

Answer 1 · 2024-01-24T16:29:52.000Z

Moving to AWS VPC CNI repo

Answer 2 · 2024-01-24T16:41:13.000Z

@AndiDog the default addon versions that are installed on cluster creation use Kubernetes server-side apply, i.e. something like:

helm template aws-vpc-cni --include-crds --namespace kube-system eks/aws-vpc-cni --set originalMatchLabels=true | kubectl apply --server-side --force-conflicts --field-manager eks -f -

It sounds like you are asking for the EKS API to provide the ability to create the cluster without any addons being installed, right?

Answer 3 · 2024-01-25T09:17:11.000Z

Having the API option to not pre-install the VPC CNI would be the best, non-hacky solution indeed.

Answer 4 · 2024-01-25T14:18:29.000Z

We found the only way to fully take control of the VPC-CNI deployment via helm is to delete all the resources (including the CRDs) before the Helm deployment. Otherwise some fields remain owned by the "eks" manager.

This is more of a problem where your helm deployment is a subset of what is deployed by default vpc-cni:

Default vpc-cni is deployed with agent sidecar.
We added the helm labels and annotations (as the docs suggested) for Helm ownership
We "upgraded" the default vpc-cni with Helm to 1.16.0, but with "enabled: false" for the agent sidecar.
The sidecar remains, due to those fields being owned by the "eks" manager.

Literally the only way to guarantee ownership is to delete everything before Helm touches the deployment.

I would also welcome the option to choose which addons (but not really an addon) get deployed by default.

Answer 5 · 2024-01-25T16:40:25.000Z

@AndiDog that support is tracked by aws/containers-roadmap#923. There is a ton of engagement and upvotes on this issue, so I assume the work is in progress, but I am not certain. It would help them prioritize if you upvote this issue as well.

Answer 6 · 2024-01-25T16:42:55.000Z

@TechnoTaff typically field management is not a problem, but I tried the following script and it seems to adopt all resources to helm and remove the network policy agent:

#!/usr/bin/env bash

set -euo pipefail

for kind in daemonSet clusterRole clusterRoleBinding serviceAccount; do
  echo "setting annotations and labels on $kind/aws-node"
  kubectl -n kube-system annotate --overwrite $kind aws-node meta.helm.sh/release-name=aws-vpc-cni
  kubectl -n kube-system annotate --overwrite $kind aws-node meta.helm.sh/release-namespace=kube-system
  kubectl -n kube-system label --overwrite $kind aws-node app.kubernetes.io/managed-by=Helm
done

kubectl -n kube-system annotate --overwrite configmap amazon-vpc-cni meta.helm.sh/release-name=aws-vpc-cni
kubectl -n kube-system annotate --overwrite configmap amazon-vpc-cni meta.helm.sh/release-namespace=kube-system
kubectl -n kube-system label --overwrite configmap amazon-vpc-cni app.kubernetes.io/managed-by=Helm

helm template aws-vpc-cni --include-crds --namespace kube-system eks/aws-vpc-cni --set originalMatchLabels=true,nodeAgent.enabled=false | kubectl apply --server-side --force-conflicts --field
-manager Helm -f -

Edit: hmm... I actually had to run the script twice to get it to properly remove the network policy agent. This seems like a strange strategy used by kubectl server-side apply

Answer 7 · 2024-01-30T23:02:04.000Z

Closing this as the true fix (outside of the script to adopt resource in Helm) will be tracked by aws/containers-roadmap#923

Answer 8 · 2024-01-30T23:02:36.000Z

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.