awslabs/amazon-eks-ami

AL2023 - Pods cannot access service endpoints with 1.30-v20240807 AMI

Closed this issue · 5 comments

What happened: Connection to Kubernetes service endpoints from a pod fails with connection timeout error

 kubectl get all -n test-cni-policy-namespace -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
pod/network-policy-allowed   1/1     Running   0          27m   100.64.46.209   ip-10-128-2-130.eu-west-1.compute.internal    <none>           <none>
pod/network-policy-server    1/1     Running   0          27m   100.64.167.32   ip-10-128-11-207.eu-west-1.compute.internal   <none>           <none>

NAME                            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE   SELECTOR
service/network-policy-server   ClusterIP   172.20.23.207   <none>        80/TCP    27m   run=network-policy-server

$ curl --fail --connect-timeout 5 -vf http://network-policy-server
* Rebuilt URL to: http://network-policy-server/
*   Trying 172.20.23.207...
* TCP_NODELAY set
* Connection timed out after 5000 milliseconds
* stopped the pause stream!
* Closing connection 0
curl: (28) Connection timed out after 5000 milliseconds

What you expected to happen: AL2 to AL2023 upgrade not breaking anything.

How to reproduce it (as minimally and precisely as possible):
setup EKS cluster with 1.30 version with AWS CNI and karpenter enabled.
Create two workloads within a namespace and try to access one workload via a Kubernetes service from the other pod.
Note: As i observed the issue occurs only when karpenter schedule the pods to a new node which to be spin up.
issue does not exist If karpenter schedule the pods in to an existing al2023 node.
Anything else we need to know?:

Environment: EKS cluster 1.30, CNI Plugin version: v1.18.3, karpenter version: controller:0.35.0

  • AWS Region: eu-west-1
  • Instance Type(s): c7g.medium
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.6
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): "1.30"
  • AMI Version: amazon-eks-node-al2023-arm64-standard-1.30-v20240807
  • Kernel (e.g. uname -a): Linux ip-10-128-11-207.eu-west-1.compute.internal 6.1.102-108.177.amzn2023.aarch64 SMP Wed Jul 31 10:18:24 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
sh-5.2$ cat /etc/eks/release
BASE_AMI_ID="ami-07852ff870d90548b"
BUILD_TIME="Wed Aug  7 20:47:10 UTC 2024"
BUILD_KERNEL="6.1.102-108.177.amzn2023.aarch64"
ARCH="aarch64"
sh-5.2$

Do you have this problem without any NetworkPolicy?

I have a network policy in place which allow all ingress Egress communication in the namespace.

Can you check if removing the policy fixes the problem? This may be an issue with the network policy agent

You are right, it seems to be the CNI is the culprit.
We have "NETWORK_POLICY_ENFORCING_MODE":"strict" in VPC CNI, which enforce us to have a network policy for inter pod communication, without this conf the communication works well.
I can can confirm we have the same config with AL2 nodes and and it works well.
Anyway i have a AWS case in progress to troubleshoot the same.

You may want to cut an issue in https://github.com/aws/aws-network-policy-agent as well; but this doesn't sound like an AMI issue as of now. I'll track the internal case as well 👍