aws-node daemonset liveness/readiness probe keep failing and daemonset fails to restart on EKS 1.27
edison-vflow opened this issue ยท 16 comments
What happened:
aws-node daemonset liveness/readiness probe keep failing and daemonset fails to restart.
When this happens, the node keeps receiving requests to schedule pods.
Any pods that are scheduled on that node remain in container creating state.
Attempts to restart the aws-node daemonset also fail
What makes the failure of aws-node daemonset critical to us is that we have about 30 microservices.If just one of them is scheduled on a node that has the issue, the entrire application fails to start.
We heavily depend on spiniing up CI/CD environments that run tests on our code before deployment.
All our CI/CD systems are currently broken due to these failures compromising what we deploy to production
We have looked at https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#known-issues and none of them seem to be our issue.
We upgraded vpc cni
addon to latest version as well on our clusters which are eks 1.27
Attach logs
- Logs have been collected and sent to
k8s-awscni-triage@amazon.com
with subject :Urgent :: Github issue https://github.com/aws/amazon-vpc-cni-k8s/issues/2743
What you expected to happen:
- Pods should always be able to create and terminate on any node they are scheduled.
However theaws-node
daemonset has in the past week or so started to fail frequently and when it fails liveness/readiness
probes, it fails to restart rendering the node unusable.When a node hasaws-node
daemonset failing, the scheduler however
keeps trying to schedule workloads on it and the container remain stuck in container creating state.
If you also try to terminate pods that are already running on that node, they fail to terminate and remain in terminating state
How to reproduce it (as minimally and precisely as possible):
- Issue happens intermittently and often but we dont have control on what node in our EKS fleet this can happen on and how
frequently.The longest our fleet of nodes in a cluster goes for without any node'saws-node
daemonset going down is about
15 minutes
Anything else we need to know?:
-
When a node's aws-node
daemonset is down, running the prescribed logs collection script ,
sudo bash eks-log-collector.sh` actually hangsOn some nodes it runs up to point :
Trying to collect common operating system logs... Trying to collect kernel logs... Trying to collect modinfo... Trying to collect mount points and volume information...
and on some up to point
Trying to collect common operating system logs... Trying to collect kernel logs...
Environment:
- Kubernetes version (use
kubectl version
): 1.27 - CNI Version : v1.16.0-eksbuild.1
- OS (e.g:
cat /etc/os-release
): Amazon linux :: AMI name => amazon-eks-node-1.27-v20231002 - Kernel (e.g.
uname -a
):
Looking at logs
Debugged via k8s Slack and concluded that the VPC CNI is not at issue here. The aws-node
pod, along with other pods, is failing the readiness probe check due to what appears to be I/O starvation. Other already running pods on these affected nodes experience the same issue.
There is a lot of CPU throttling, I/O throttling, and OOMs happening on this node, and these events are the likely causes of the containerd/kubelet errors.
Thanks @jdn5126 , we will try removing GuardDuty first from just this very busy cluster.
We have also started looking at our limits so that the pods dont burst way too high beyond a node capacity.
I will give feedback on how it goes !
Hi @jdn5126 , removed GuardDuty agent that was consuming a significant amount of resources.
It was also causing contention with mountpoint access.
Things are looking much better and we are moving towards ensuring that our nodes are not too close to full capacity resource wise.
There is a new error we are now getting after things have become more stable
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to create new parent process: namespace path: lstat /proc/0/ns/ipc: no such file or directory: unknown
This error is being addressed in open issue containerd/containerd#9160
In that issue, there is mention that this PR resolves the issue.
However, the PR is merged in containerd v1.7.7 https://github.com/containerd/containerd/commits/v1.7.7 and yet the latest available containerd version for EKS ami's is 1.7.2 , according to https://github.com/awslabs/amazon-eks-ami/releases
Do you know when EKS ami's will be on containerd v1.7.7 or later ?
Glad to hear @edison-vflow. I reached out to the AL2 team to figure out the timeline for updating the containerd version. Will update as soon as I hear back.
@edison-vflow the Amazon Linux 2 team is in the process of bundling containerd 1.7.11, so that will be included in future AMIs. No ECD at this time, but targeting by end of the month
Just wanted to note that the Amazon Linux 2 update to containerd 1.7.11 is still in progress... I am not sure why it is taking them this long
It's finally here with containerd: 1.7.11-1.
!
https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240202
Thanks everyone
We're on the latest AMI but however we still encountering sporadic errors during pod startup
Environment:
- Kubernetes version (use kubectl version):
v1.25.16-eks-5e0fdde
- CNI Version : v1.16.0-eksbuild.1`
- OS (e.g: cat /etc/os-release): Amazon linux :: AMI name =>
amazon-eks-node-1.25-v20240202
We'll continue investigation as our case probably related to the reactive load during workload scheduling, but I also left some evidences below for anyone else having the same issues:
So no we're getting this error for some pods sporadically with StartError
state from the Kubernetes perspective:
Warning Failed 11m kubelet Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to create new parent process: namespace path: lstat /proc/0/ns/ipc: no such file or directory: unknown
Describe pod gives:
State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to create new parent process: namespace path: lstat /proc/0/ns/ipc: no such file or directory: unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 01:00:00 +0100
Finished: Mon, 05 Feb 2024 09:32:47 +0100
Other pods have this message as well:
State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: context canceled: unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 01:00:00 +0100
Finished: Mon, 05 Feb 2024 07:18:14 +0100
I'm also concerned about Started: Thu, 01 Jan 1970 01:00:00 +0100
๐ง
@VLZZZ can you bring this up at https://github.com/awslabs/amazon-eks-ami/issues?
@jdn5126 Thank you, I've created standalone issue just in case
awslabs/amazon-eks-ami#1651
Closing as the containerd version has been updated and there are no active issues in this thread
This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.