awslabs/amazon-eks-ami

Pods cannot access Kubernetes service on AL2023 manage nodes with no remote ssh access

Closed this issue · 5 comments

What happened:

Sorry this may sound like a bit of a weird one. While trying to switch to the latest AL2023_x86_64_STANDARD AMI, pods are unable to access The Kubernetes service (ex: https://172.23.0.1:443/api/v1/namespaces/... when there is no remote SSH access or security groups specified for it.

It seems like the cluster and instances/ENIs all have the correct eks-cluster-sg-my-cluster-1234679 security group with full access with all ports/protocols to itself.

I was able to disable remote access using AL2 AMIs, and also switch to AL2023 with remote access enabled, but this combination of AL2023 without remote access seems to be the only broken one. I'm not totally sure if this can be traced back to an AMI issue, but figured it's worth asking here.

For example, deploying Kyverno results in the following log:

E0327 22:57:09.268372       1 reflector.go:147] k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://172.23.0.1:443/api/v1/namespaces/kyverno/configmaps?fieldSelector=metadata.name%3Dkyverno-metrics&limit=500&resourceVersion=0": dial tcp 172.23.0.1:443: i/o timeout

What you expected to happen:

Able to switch AMI types and disable remote SSH access without issue.

How to reproduce it (as minimally and precisely as possible):

Using the following managed node group in Terraform:

resource "aws_eks_node_group" "spot_t3_xlarge_blue" {
  cluster_name           = aws_eks_cluster.my_cluster.name
  node_group_name_prefix = "my-cluster-managed-spot-t3-xl-"
  node_role_arn = data.aws_iam_role.my_node_role.arn
  subnet_ids    = slice(aws_subnet.private.*.id, 0, 3)

  # ami_type        = "AL2_x86_64"
  # release_version = "1.29.0-20240315"

  ami_type        = "AL2023_x86_64_STANDARD"
  release_version = "1.29.0-20240315"

  capacity_type = "SPOT"
  disk_size     = "50"

  instance_types = ["t3a.xlarge"]

  # remote_access {
  #   ec2_ssh_key               = ["my-ssh-key"
  #   source_security_group_ids = [aws_security_group.my_extra_sg.id]
  # }

  scaling_config {
    desired_size = 2
    max_size     = 5
    min_size     = 0
  }

  update_config {
    max_unavailable_percentage = 50
  }

  force_update_version = true

  lifecycle {
    ignore_changes        = [scaling_config[0].desired_size]
    create_before_destroy = true
  }
}

Anything else we need to know?:

It seems to be an issue for any pod that uses the kubernetes API - such as Kyverno, CSI drivers, ingress controllers, etc.

Environment:

  • AWS Region: us-west-2
  • Instance Type(s): t3a.xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.3"
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): "1.29"
  • AMI Version: 1.29.0-20240315
  • Kernel (e.g. uname -a): Linux ip-10-62-1-76.us-west-2.compute.internal 6.1.79-99.164.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Feb 27 18:02:23 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-04080b4b23b7f83cd"
BUILD_TIME="Fri Mar 15 04:44:46 UTC 2024"
BUILD_KERNEL="6.1.79-99.164.amzn2023.x86_64"
ARCH="x86_64"

@evandam
Hi it's highly likely due to a known issue with our EKS AL2023 ami, and could be addressed by #1738.

The behavior is "some al2023 node" works fine while pods(who's IP is on secondary ENIs) on "some other al2023" nodes don't have connectivity..

@evandam Would you be able to check if the affected pod IPs are attached to Secondary ENIs on the node?

Hey @M00nF1sh and @achevuru thanks for the response! I can confirm these pod IPs are attached to secondary ENIs. Happy to test out the fix as soon as it's available if it helps!

@evandam this fix is in the latest AMI release: https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240329

Let us know if you're still running into this!

I probably should not plug this here, and will open a new discussion and an issue once someone takes an exception to this being here.


  • saw a connectivity timeout to kubernetes.default.svc first from a custom validating-webhook being tested, then from test pod curl -v 172.20.0.10:443
  • I see iptables-nft is used by default on AL2023, and couldnt see any kube-proxy KUBE-Service-Endpoint rules anywhere...
  • I replaced with iptables-legacy and the rules were made, and test on pods connected afterwards , I cannot say if this is what really did anything or what actually happpened
  • because a node scaler component replaced the patched node with a new node with iptables-nft, and the issue seems sporadic or gone? I can connect right now

Calling attention to it for now