AL2023 - Pods cannot access service endpoints with 1.30-v20240807 AMI
Closed this issue · 5 comments
What happened: Connection to Kubernetes service endpoints from a pod fails with connection timeout error
kubectl get all -n test-cni-policy-namespace -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/network-policy-allowed 1/1 Running 0 27m 100.64.46.209 ip-10-128-2-130.eu-west-1.compute.internal <none> <none>
pod/network-policy-server 1/1 Running 0 27m 100.64.167.32 ip-10-128-11-207.eu-west-1.compute.internal <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/network-policy-server ClusterIP 172.20.23.207 <none> 80/TCP 27m run=network-policy-server
$ curl --fail --connect-timeout 5 -vf http://network-policy-server
* Rebuilt URL to: http://network-policy-server/
* Trying 172.20.23.207...
* TCP_NODELAY set
* Connection timed out after 5000 milliseconds
* stopped the pause stream!
* Closing connection 0
curl: (28) Connection timed out after 5000 milliseconds
What you expected to happen: AL2 to AL2023 upgrade not breaking anything.
How to reproduce it (as minimally and precisely as possible):
setup EKS cluster with 1.30 version with AWS CNI and karpenter enabled.
Create two workloads within a namespace and try to access one workload via a Kubernetes service from the other pod.
Note: As i observed the issue occurs only when karpenter schedule the pods to a new node which to be spin up.
issue does not exist If karpenter schedule the pods in to an existing al2023 node.
Anything else we need to know?:
Environment: EKS cluster 1.30, CNI Plugin version: v1.18.3, karpenter version: controller:0.35.0
- AWS Region: eu-west-1
- Instance Type(s): c7g.medium
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.6 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
): "1.30" - AMI Version: amazon-eks-node-al2023-arm64-standard-1.30-v20240807
- Kernel (e.g.
uname -a
): Linux ip-10-128-11-207.eu-west-1.compute.internal 6.1.102-108.177.amzn2023.aarch64 SMP Wed Jul 31 10:18:24 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux - Release information (run
cat /etc/eks/release
on a node):
sh-5.2$ cat /etc/eks/release
BASE_AMI_ID="ami-07852ff870d90548b"
BUILD_TIME="Wed Aug 7 20:47:10 UTC 2024"
BUILD_KERNEL="6.1.102-108.177.amzn2023.aarch64"
ARCH="aarch64"
sh-5.2$
Do you have this problem without any NetworkPolicy
?
I have a network policy in place which allow all ingress Egress communication in the namespace.
Can you check if removing the policy fixes the problem? This may be an issue with the network policy agent
You are right, it seems to be the CNI is the culprit.
We have "NETWORK_POLICY_ENFORCING_MODE":"strict" in VPC CNI, which enforce us to have a network policy for inter pod communication, without this conf the communication works well.
I can can confirm we have the same config with AL2 nodes and and it works well.
Anyway i have a AWS case in progress to troubleshoot the same.
You may want to cut an issue in https://github.com/aws/aws-network-policy-agent as well; but this doesn't sound like an AMI issue as of now. I'll track the internal case as well 👍