aws/amazon-vpc-cni-k8s

aws-eks-nodeagent in CrashloopBackoff on addon update

kramuenke-catch opened this issue ยท 29 comments

Hi all,

We tried to update the vpc-cni addon from v1.13.4-eksbuild.1 to v1.14.0-eksbuild.3 and the update fails with the aws-eks-nodeagent being in CrashloopBackoff

We are running on EKS 1.27 and not setting anything config for aws-eks-nodeagent

A kubectl logs aws-node-98rkh aws-eks-nodeagent -n kube-system prints

{"level":"info","ts":"2023-09-04T02:21:39Z","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}

So we aren't sure what causes the container to crash. We pass the following configuration to the vpc-cni addon

  AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: "true"
  ENI_CONFIG_LABEL_DEF: topology.kubernetes.io/zone
  ENABLE_PREFIX_DELEGATION: "true"
  WARM_ENI_TARGET: "2"
  WARM_PREFIX_TARGET: "4"
  AWS_VPC_K8S_CNI_EXTERNALSNAT: "true"

Otherwise its all defaults.

@ kramuenke-catch How did you upgrade the addon? Managed addon or via helm or via kubectl cmds? Did you enable Network policy support? EKS AMI or Custom AMI?

If you enabled Network Policy support, have you looked at the pre-reqs and/or tried following the installation steps documented here - https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html ?

For issues with EKS Node Agent - you can open the ticket in the Network policy agent repo instead - @ https://github.com/aws/aws-network-policy-agent

Facing exact same issue during managed vpc-cni addon upgrade from v1.13.4-eksbuild.1 to v1.14.0-eksbuild.3. Network policy support is disabled (--enable-network-policy=false default flag).

This is happening during EKS cluster upgrade from 1.26 -> 1.27. Worker node AMI is still from 1.26. I'm going to update it to the latest one from 1.27 and check if that is going to help.

bunicb commented

Same exact issue here, although using EKS 1.24, managed vpc-cni addon, Ubuntu EKS optimized AMI, no changes regarding Network policy support (as it requires 1.25).

Observed while vpc-cni upgraded from v1.13.4-eksbuild.1 to v1.14.0-eksbuild.3, and on a completely new EKS cluster with vpc-cni v1.14.0-eksbuild.3. Downgrading the addon successfully brought them back to life.

This is happening during EKS cluster upgrade from 1.26 -> 1.27. Worker node AMI is still from 1.26. I'm going to update it to the latest one from 1.27 and check if that is going to help.

tried with the latest 1.27.4-20230825 Node AMI, the issue is still there, nothing in the logs of the aws-network-policy-agent which is in the CrashLoopBackOff:

โ”‚ {"level":"info","ts":"2023-09-04T12:42:45Z","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}

had to downgrade the addon back to v1.13.4-eksbuild.1

Facing similar problem. It works well on fresh new cluster but doesn't on existing one. FYI: Have been using calico operator so far, tried removing it but still doesn't work.

Can you please share the o/p of -

kubectl get crd | grep policy
policyendpoints.networking.k8s.aws           2023-08-16T04:49:17Z

I tried upgrade locally on a 1.27 cluster from 1.13.4-eksbuild.1 -> 1.14.0-eksbuild.3 and upgrade works fine -

kubectl get pods -n kube-system -owide
NAME                       READY   STATUS    RESTARTS     AGE    IP               NODE                                           NOMINATED NODE   READINESS GATES
aws-node-mzwbr             2/2     Running   0            94s    192.168.16.63    ip-192-168-16-63.us-west-2.compute.internal    <none>           <none>
aws-node-p2hqw             2/2     Running   0            111s   192.168.82.239   ip-192-168-82-239.us-west-2.compute.internal   <none>           <none>


kubectl get crd
NAME                                         CREATED AT
cninodes.vpcresources.k8s.aws                2023-08-09T04:30:24Z
eniconfigs.crd.k8s.amazonaws.com             2023-08-09T04:30:20Z
policyendpoints.networking.k8s.aws           2023-08-16T04:49:17Z
securitygrouppolicies.vpcresources.k8s.aws   2023-08-09T04:30:24Z

Also if the CRD is installed, can you share the node log bundle? you can email it to us @ k8s-awscni-triage@amazon.com

I've tried upgrading the addon from 1.12.x and 1.13.x to 1.14.0 on 1.25/1.26/1.27 clusters (new and existing) and I'm not able to reproduce the above issue. As requested above, please check if the CRD is installed and do share the node logs via the email shared above.

Also, please do share on how you upgraded the addon and the output of amazon-vpc-cni configmap under kube-system namespace.

Also, can you also check if you have port 8080 on host already used by another process/container? If yes, then that can be the source of the issue you're observing. You can change the default metrics port used by network policy agent via this flag - https://github.com/aws/aws-network-policy-agent/blob/main/pkg/config/runtime_config.go#L17 . We will make this configurable via helm as well

I also encountered the same issue when upgrading the vpc-cni addon from v1.13.4-eksbuild.1 to v1.14.0-eksbuild.3 in the AWS EKS version 1.26, with the operating system being Bottlerocket OS 1.14.3. However, this issue does not arise during the upgrade process in the new EKS version 1.26 locally. And it appears that port 8080 is not found on the host.

Also, please do share on how you upgraded the addon and the output of amazon-vpc-cni configmap under kube-system namespace.

๐Ÿ‘‹ we are running the vpc-cni addon on an EKS 1.27 with EKS managed bottlerocket node groups on AMI release version 1.14.3-764e37e4

Our addon update usually goes through eksctl but i also tried manually via:

aws eks update-addon --cluster-name my-cluster --addon-name vpc-cni --addon-version v1.14.0-eksbuild.3

yaml version of the config map:

apiVersion: v1
data:
  enable-network-policy-controller: "false"
  enable-windows-ipam: "false"
kind: ConfigMap
metadata:
  creationTimestamp: "2023-09-04T00:58:57Z"
  labels:
    app.kubernetes.io/instance: aws-vpc-cni
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: aws-node
    app.kubernetes.io/version: v1.14.0
    helm.sh/chart: aws-vpc-cni-1.14.0
    k8s-app: aws-node
  name: amazon-vpc-cni
  namespace: kube-system
  resourceVersion: "46392531"
  uid: 74f00062-cc21-43d6-8641-1eacc1c495aa

We have not installed any CRDs assuming this is managed by the addon. I would not expect to require any CRDs as long as we do not enable the network policy controller

We are also facing the same issue after upgrading the vpc cni from v13 to v14 on EKS v1.24

Check if we config VPC-CNI add-on use service account role: AmazonEKSVPCCNIRole

We use this commend to enable network policy:
aws eks update-addon --cluster-name my-cluster --addon-name vpc-cni --addon-version v1.14.0-eksbuild.3
--service-account-role-arn arn:aws:iam::account-id:role/AmazonEKSVPCCNIRole
--resolve-conflicts PRESERVE --configuration-values '{"enableNetworkPolicy": "true"}'

If we config VPC-CNI add-on use inherit node role: AmazonEKSNodeRole

We use this command to enable network policy:
aws eks update-addon --cluster-name my-cluster --addon-name vpc-cni --addon-version v1.14.0-eksbuild.3
--resolve-conflicts PRESERVE --configuration-values '{"enableNetworkPolicy": "true"}'

Try it.

@kramuenke-catch Yes, the CRD will be EKS managed and will be installed on new cluster creation (1.25+) and/or via VPC CNI addon. If you install/upgrade via the provided manifest or helm or Managed addon then the CRD should be installed. Do you have policyendpoints.networking.k8s.aws installed on your cluster? Also, as called out above please do check if port 8080 on host is used by any other process/container in your worker nodes and if yes please change the port used by the node agent container as called out in the github release note. Can you also please share your node log bundle with us @ @ k8s-awscni-triage@amazon.com.

I quickly validated the upgrade flow(from 1.12.x/1.13.x to 1.14.0) on bottlerocket nodes and I'm not able to reproduce the issue you're seeing. So, node logs should help us identify if it is indeed the port conflict causing the issue.

We also ran into this issue. It appears to be caused by a port conflict between the node-local-dns cache and the network policy agent. Node local dns cache binds to port 8080 on the hosts https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml#L165. The network policy agent also attempts to bind to that port {"level":"info","timestamp":"2023-09-05T17:39:15.026Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"} {"level":"error","timestamp":"2023-09-05T17:39:15.026Z","logger":"controller-runtime.metrics","msg":"metrics server failed to listen. You may want to disable the metrics server or use another port if it is due to conflicts","error":"error listening on :8080: listen tcp :8080: bind: address already in use","stacktrace":"sigs.k8s.io/controller-runtime/pkg/metrics.NewListener\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/metrics/listener.go:48\nsigs.k8s.io/controller-runtime/pkg/manager.New\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.0/pkg/manager/manager.go:455\nmain.main\n\t/workspace/main.go:86\nruntime.main\n\t/root/sdk/go1.20.4/src/runtime/proc.go:250"} {"level":"error","timestamp":"2023-09-05T17:39:15.026Z","logger":"setup","msg":"unable to create controller manager","error":"error listening on :8080: listen tcp :8080: bind: address already in use","stacktrace":"main.main\n\t/workspace/main.go:88\nruntime.main\n\t/root/sdk/go1.20.4/src/runtime/proc.go:250"}. To reproduce setup node local dns on a cluster with VPC CNI 1.13, then attempt to upgrade to 1.14.

The node-local-dns cache is an essential component in production Kubernetes clusters. Is there a way to modify what port the network policy agent binds to in the CNI?

Thanks for providing this debugging information. Currently, the way to change the port that the node agent binds to for metrics is to pass the metrics-bind-addr command line argument to the node agent container.

We are working on making this configurable through the managed addon utility, and are deciding whether we should pass this flag with a different default than 8080

That options is not available to users of the eks addon from what I can see in the schema. We need a way to configure it there.

That options is not available to users of the eks addon from what I can see in the schema. We need a way to configure it there.

Right, as I mentioned, we are working on making this configurable through the managed addon utility. To workaround this, you can directly modify the container in the daemonset, i.e. kubectl edit ds aws-node -n kube-system.

Also can confirm we also conflicted with node-local-dns

Same here, changing node-local-dns-cache health-check port from 8080 to some other solved the issue

Other than pinning explicit versions (for example, pinning v1.13.4-eksbuild.1 until this is fixed), is there a way to avoid such breaking changes from automatically rolling out to clusters? I like the idea of most_recent = true (in terraform-aws-modules/eks/aws), but not if it's gonna break clusters.

Also, given that it's identified as a breaking change in the release notes, why was only the minor version bumped?

Is disabling the network policy engine (i.e. I presume setting VpcCni.enableNetworkPolicy to false in the addon config) sufficient to avoid this port conflict with node-local-dns? I'm happy with chaining the Cilium netpol engine for the time being.

@yurrriq AWS does not maintain terraform, and while we strive to never introduce new versions that could lead to any breakage, it can happen. The issue here is a port conflict with other applications, and we cannot control what ports other applications listen on. To decrease the likelihood of conflicts, we are changing the default port that the node agent uses to one that is less likely to conflict. Similarly, one can change the port that these other applications, such as NodeLocal DNS, uses.

The release note calls this out as a breaking change since there is a potential for port conflicts with other Kubernetes applications, but there is no breaking change from a Kubernetes/EKS API standpoint or from previous VPC CNI versions. That is why this is a minor change.

Setting VpcCni.enableNetworkPolicy=false does not prevent the node agent from starting and binding to that port for metrics. That flag is consumed by the controller. To modify the metrics port used by the node agent, you have to pass the metrics-bind-addr, as mentioned above. We plan to have the default changed and configurable through MAO in the next week.

@yurrriq As called out above, the issue is due to port conflict with another application on the cluster. While 8080 is the default, it is configurable via a flag. You will be able to modify the port during the addon upgrade. No matter what port we pick as default it can potentially conflict with some application on the user end. We do understand that 8080 is a popular default option for quite a few applications out there so we're moving to a random default port to (hopefully) avoid these conflicts.

Here, we solved by passing the metrics-bind-addr in the args:

example: --metrics-bind-addr=:9985

But the issue is: painful!!!, because we are applying the add-ons with the eks terraform module, which does not support passing args

Please make metrics port configurable in the same way as enableNetworkPolicy flag.

Yes, this will be done as part of 1.14.1 release.

Closing the issue as we made the metrics port configurable in release v1.14.1.

โš ๏ธCOMMENT VISIBILITY WARNINGโš ๏ธ

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Use the syntax same as this one only --metrics-bind-addr=:9985
--metrics-bind-addr=9985 will not work as its not the correct format

Thanks for the responses, @jdn5126 and @achevuru. Makes sense to me. And thanks for the v1.14.1 release. It's working without issue for us. (We too use node-local-dns with its hardcoded port 8080)

Edit: I opened #2565
I can open a new issue or rtfm, but in case you see this and feel like answering: what's the function of the nodeagent sidecar if we're not using it for the netpol engine? Am I correct in thinking it's unnecessary bloat/complexity in that case?