aws/amazon-vpc-cni-k8s

Pods stuck in `CrashLoopBackoff` when restarting custom EKS node.

ddl-pjohnson opened this issue · 7 comments

What happened:

We have a custom AMI that we deploy to EC2 and connect to an existing EKS cluster. We start and stop this node as needed to save costs. In addition the instance has state that we want to maintain across restarts i.e. we don't want to get a new node every restart.

Over the past 2 months we've noticed an issue where k8s doesn't restart properly. some pods get stuck in a CrashLoopBackoff when they try to connect to other pods or services. DNS resolves to the correct IP address, however packets aren't routed to the other pod correctly. It seems like a race condition, where pods start before the network is set up correctly.

The only reliable fix we've found is to delete all pods and let k8s recreate them, this seems to set up the correct iptables rules.

Is there a better way to fix this? It kind of looks like projectcalico/calico#5135, but not sure if the problem is in Calico or AWS.

Environment:

  • Kubernetes version:
    v1.27.10-eks-508b6b3

  • CNI Version
    amazon-k8s-cni:v1.15.1-eksbuild.1

  • OS (e.g: cat /etc/os-release):

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
  • Kernel (e.g. uname -a):
Linux pauljo41250 5.10.197-186.748.amzn2.x86_64 #1 SMP Tue Oct 10 00:30:07 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Is there a better way to fix this? It kind of looks like projectcalico/calico#5135, but not sure if the problem is in Calico or AWS.

Do you have both Calico and VPC CNI? Do you where the specific error message is coming from?
Could you share the status of your pods aws-node and other pods in kube-system namespace?

  1. What does the pod log of the CrashLoopBackOff Containers say?
  2. What does the IPAMD log say ?

Yep, we have both installed.

The errors are in pod logs and it's somewhat random what pods have errors. Usually they are connection refused errors connecting to the Kube API or other pods, e.g.:
Invalid Kubernetes API v1 endpoint https://172.20.0.1:443/api: Timed out connecting to server

Or connecting to another pod:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='nucleus-frontend', port=80)

The exact errors vary by things like the language used and what they're connecting to. In all cases DNS works correctly, but the packets aren't routed to the other pod/service.

Is there a secure way to send you logs and pod statuses?

Invalid Kubernetes API v1 endpoint https://172.20.0.1:443/api: Timed out connecting to server

This is strange error message.
Can you confirm the the API server endpoint match?

kubectl get endpoints kubernetes -o jsonpath='{.subsets[].addresses[].ip}'

I would expect the API path to be /api/v1 in the error message. I am not sure why it tried to connect at /api

You can follow this troubleshooting doc - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md and send the logs to 'k8s-awscni-triage@amazon.com' for us to investigate.

I am suspecting that kube-proxy isn't running when this error occurred, but the description of the error itself isn't typical either.

It's not just the kubernetes api, it's basically random what services and pods can be connected and which can't e.g. a pod won't be able to connect to our rabbitmq service, or another one will be able to connect to rabbitmq, but won't connect to vault etc.

We've fixed this by draining/cordoning the node on startup. I'll try tracking down the bundle of logs and sending them through.

We've fixed this by draining/cordoning the node on startup.

Was this node specific behavior? If yes, perhaps there is some thing it is running on the node that changing iptables. Yes, logs will help.

Closing this as Cx were able to resolve this at the node level.

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.