Issue with AMI Release v20240817, AWS CNI not working
jma-amboss opened this issue · 6 comments
What happened:
Upgrading EKS ami from version, AMI Release v20240807 to version AMI Release v20240817 breaks the AWS CNI. Reverting to the previous AMI resolves the issue.
CNI version: v1.18.3
What you expected to happen:
AMI minor version upgrade not breaking anything
How to reproduce it (as minimally and precisely as possible):
Upgrade EKS worker nodes to AMI Release v20240817
Anything else we need to know?:
The IP addresses are not getting correctly assigned. There are errors in kube-proxy
E0826 15:52:11.209793 1 reflector.go:147] vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.EndpointSlice: failed to list *v1.EndpointSlice: Get "https://<redacted>.<redacted>.us-east-1.eks.amazonaws.com/apis/discovery.k8s.io/v1/endpointslices?labelSelector=%21service.kubernetes.io%2Fheadless%2C%21service.kubernetes.io%2Fservice-proxy-name&limit=500&resourceVersion=0": dial tcp: lookup <redacted>.<redacted>.us-east-1.eks.amazonaws.com on [::1]:53: dial udp [::1]:53: connect: cannot assign requested address
There are also noted log messages on the node level
{"level":"error","ts":"2024-08-26T14:34:42.760Z","logger":"controller-runtime.source.EventHandler","caller":"source/kind.go:68","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: networking.k8s.aws/v1alpha1: Get \"[https://172.20.0.1:443/apis/networking.k8s.aws/v1alpha1](https://172.20.0.1/apis/networking.k8s.aws/v1alpha1)\": dial tcp 172.20.0.1:443: connect: connection timed out"} {"level":"error","ts":"2024-08-26T14:34:52.745Z","logger":"controller-runtime.source.EventHandler","caller":"source/kind.go:68","msg":"failed to get informer from cache","error":"failed to get API group resources: unable to retrieve the complete list of server APIs: networking.k8s.aws/v1alpha1: Get \"[https://172.20.0.1:443/apis/networking.k8s.aws/v1alpha1](https://172.20.0.1/apis/networking.k8s.aws/v1alpha1)\": dial tcp 172.20.0.1:443: connect: connection timed out"}
Also noted, there is no connectivity from the EC2 instance to the resolved IP address for the EKS cluster.
Environment:
- AWS Region: us-east-1
- CNI version: v1.18.3
- Instance Type(s): t3.2xlarge, t3.large
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.16 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
): 1.28 - AMI Version: ami-0d364e0801521b622
- Kernel (e.g.
uname -a
): unknown, already reverted the AMI - Release information (run
cat /etc/eks/release
on a node):
unknown, already reverted AMI
@jma-amboss can you open a case with AWS support so we can get some more information? We're not able to reproduce anything like this.
@ing-ash, @shubha-shyam, @asri-badlah if any of you have seen this issue and can open an AWS support case, it'd be really helpful. We need logs from an instance having this problem to determine a root cause.
@cartermckinnon Unfortunately, we don't have an AWS support plan. We can only contact our reseller, which provides us with basic support.
Understood; do you see any evidence in the containerd
logs that this could be related to #1933 ?
I don't know if my issue is exactly the same, but I was just moving from AL2 to AL2023 and discovered something similar. I get CoreDNS issues (where it can't talk to Kubernetes) but if I kill the pod multiple times in a row it will eventually not complain.
I discovered this for other pods that talk to the Kubernetes API, if I restarted them enough times it would get to healthy but things would still be glitchy and I'd have ongoing dns issues. Reverting to AL2 immediately fixed things.
I made a case (172513694900840) so if there's anything I can gather for you I can help troubleshoot this.
I dug around in the logs on the nodes themselves (just looking at with journalctl -ef) but I didn't see anything that leapt out at me as obviously broken.
I have a cluster I can switch back over to these nodes for testing purposes if you need me to run anything specific (I rolled everything back for now. I did not try to downgrade the CNI). This was EKS 1.29, for reference.
@apenney that sounds like a different issue, the original report was on AL2.
I don't see a smoking gun here. Timeouts to the API server can happen for many reasons, and without more information we can't really narrow down the cause. I'm not aware of any issues in us-east-1 at the time this was occurring. If you can provide more information, please @ mention me.