Pods cannot access Kubernetes service on AL2023 manage nodes with no remote ssh access
Closed this issue · 5 comments
What happened:
Sorry this may sound like a bit of a weird one. While trying to switch to the latest AL2023_x86_64_STANDARD
AMI, pods are unable to access The Kubernetes service (ex: https://172.23.0.1:443/api/v1/namespaces/...
when there is no remote SSH access or security groups specified for it.
It seems like the cluster and instances/ENIs all have the correct eks-cluster-sg-my-cluster-1234679
security group with full access with all ports/protocols to itself.
I was able to disable remote access using AL2 AMIs, and also switch to AL2023 with remote access enabled, but this combination of AL2023 without remote access seems to be the only broken one. I'm not totally sure if this can be traced back to an AMI issue, but figured it's worth asking here.
For example, deploying Kyverno results in the following log:
E0327 22:57:09.268372 1 reflector.go:147] k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.23.0.1:443/api/v1/namespaces/kyverno/configmaps?fieldSelector=metadata.name%3Dkyverno-metrics&limit=500&resourceVersion=0": dial tcp 172.23.0.1:443: i/o timeout
What you expected to happen:
Able to switch AMI types and disable remote SSH access without issue.
How to reproduce it (as minimally and precisely as possible):
Using the following managed node group in Terraform:
resource "aws_eks_node_group" "spot_t3_xlarge_blue" {
cluster_name = aws_eks_cluster.my_cluster.name
node_group_name_prefix = "my-cluster-managed-spot-t3-xl-"
node_role_arn = data.aws_iam_role.my_node_role.arn
subnet_ids = slice(aws_subnet.private.*.id, 0, 3)
# ami_type = "AL2_x86_64"
# release_version = "1.29.0-20240315"
ami_type = "AL2023_x86_64_STANDARD"
release_version = "1.29.0-20240315"
capacity_type = "SPOT"
disk_size = "50"
instance_types = ["t3a.xlarge"]
# remote_access {
# ec2_ssh_key = ["my-ssh-key"
# source_security_group_ids = [aws_security_group.my_extra_sg.id]
# }
scaling_config {
desired_size = 2
max_size = 5
min_size = 0
}
update_config {
max_unavailable_percentage = 50
}
force_update_version = true
lifecycle {
ignore_changes = [scaling_config[0].desired_size]
create_before_destroy = true
}
}
Anything else we need to know?:
It seems to be an issue for any pod that uses the kubernetes API - such as Kyverno, CSI drivers, ingress controllers, etc.
Environment:
- AWS Region: us-west-2
- Instance Type(s): t3a.xlarge
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
): "eks.3" - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
): "1.29" - AMI Version: 1.29.0-20240315
- Kernel (e.g.
uname -a
):Linux ip-10-62-1-76.us-west-2.compute.internal 6.1.79-99.164.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Feb 27 18:02:23 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- Release information (run
cat /etc/eks/release
on a node):
BASE_AMI_ID="ami-04080b4b23b7f83cd"
BUILD_TIME="Fri Mar 15 04:44:46 UTC 2024"
BUILD_KERNEL="6.1.79-99.164.amzn2023.x86_64"
ARCH="x86_64"
@evandam Would you be able to check if the affected pod IPs are attached to Secondary ENIs on the node?
@evandam this fix is in the latest AMI release: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240329
Let us know if you're still running into this!
I probably should not plug this here, and will open a new discussion and an issue once someone takes an exception to this being here.
- saw a connectivity timeout to
kubernetes.default.svc
first from a custom validating-webhook being tested, then from test podcurl -v 172.20.0.10:443
- I see iptables-nft is used by default on AL2023, and couldnt see any kube-proxy KUBE-Service-Endpoint rules anywhere...
- I replaced with iptables-legacy and the rules were made, and test on pods connected
afterwards
, I cannot say if this is what really did anything or what actually happpened - because a node scaler component replaced the patched node with a new node with iptables-nft, and the issue seems sporadic or gone? I can connect right now