AL2023 - Pods cannot access service endpoints with 1.30-v20240807 AMI

Question

AL2023 - Pods cannot access service endpoints with 1.30-v20240807 AMI

Closed this issue 4 months ago · 5 comments

What happened: Connection to Kubernetes service endpoints from a pod fails with connection timeout error

 kubectl get all -n test-cni-policy-namespace -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
pod/network-policy-allowed   1/1     Running   0          27m   100.64.46.209   ip-10-128-2-130.eu-west-1.compute.internal    <none>           <none>
pod/network-policy-server    1/1     Running   0          27m   100.64.167.32   ip-10-128-11-207.eu-west-1.compute.internal   <none>           <none>

NAME                            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE   SELECTOR
service/network-policy-server   ClusterIP   172.20.23.207   <none>        80/TCP    27m   run=network-policy-server

$ curl --fail --connect-timeout 5 -vf http://network-policy-server
* Rebuilt URL to: http://network-policy-server/
*   Trying 172.20.23.207...
* TCP_NODELAY set
* Connection timed out after 5000 milliseconds
* stopped the pause stream!
* Closing connection 0
curl: (28) Connection timed out after 5000 milliseconds

What you expected to happen: AL2 to AL2023 upgrade not breaking anything.

How to reproduce it (as minimally and precisely as possible):
setup EKS cluster with 1.30 version with AWS CNI and karpenter enabled.
Create two workloads within a namespace and try to access one workload via a Kubernetes service from the other pod.
Note: As i observed the issue occurs only when karpenter schedule the pods to a new node which to be spin up.
issue does not exist If karpenter schedule the pods in to an existing al2023 node.
Anything else we need to know?:

Environment: EKS cluster 1.30, CNI Plugin version: v1.18.3, karpenter version: controller:0.35.0

AWS Region: eu-west-1
Instance Type(s): c7g.medium
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.6
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): "1.30"
AMI Version: amazon-eks-node-al2023-arm64-standard-1.30-v20240807
Kernel (e.g. uname -a): Linux ip-10-128-11-207.eu-west-1.compute.internal 6.1.102-108.177.amzn2023.aarch64 SMP Wed Jul 31 10:18:24 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
Release information (run cat /etc/eks/release on a node):

sh-5.2$ cat /etc/eks/release
BASE_AMI_ID="ami-07852ff870d90548b"
BUILD_TIME="Wed Aug  7 20:47:10 UTC 2024"
BUILD_KERNEL="6.1.102-108.177.amzn2023.aarch64"
ARCH="aarch64"
sh-5.2$

Answer 1 · 2024-09-03T19:27:11.000Z

Do you have this problem without any NetworkPolicy?

Answer 2 · 2024-09-04T06:01:21.000Z

I have a network policy in place which allow all ingress Egress communication in the namespace.

Answer 3 · 2024-09-04T15:33:04.000Z

Can you check if removing the policy fixes the problem? This may be an issue with the network policy agent

Answer 4 · 2024-09-06T11:08:17.000Z

You are right, it seems to be the CNI is the culprit.
We have "NETWORK_POLICY_ENFORCING_MODE":"strict" in VPC CNI, which enforce us to have a network policy for inter pod communication, without this conf the communication works well.
I can can confirm we have the same config with AL2 nodes and and it works well.
Anyway i have a AWS case in progress to troubleshoot the same.

Answer 5 · 2024-09-06T19:07:13.000Z

You may want to cut an issue in https://github.com/aws/aws-network-policy-agent as well; but this doesn't sound like an AMI issue as of now. I'll track the internal case as well 👍