aws/amazon-vpc-cni-k8s

aws-node daemonset liveness/readiness probe keep failing and daemonset fails to restart on EKS 1.27

edison-vflow opened this issue ยท 16 comments

What happened:

aws-node daemonset liveness/readiness probe keep failing and daemonset fails to restart.

When this happens, the node keeps receiving requests to schedule pods.
Any pods that are scheduled on that node remain in container creating state.
Attempts to restart the aws-node daemonset also fail

What makes the failure of aws-node daemonset critical to us is that we have about 30 microservices.If just one of them is scheduled on a node that has the issue, the entrire application fails to start.

We heavily depend on spiniing up CI/CD environments that run tests on our code before deployment.
All our CI/CD systems are currently broken due to these failures compromising what we deploy to production

We have looked at https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#known-issues and none of them seem to be our issue.

We upgraded vpc cni addon to latest version as well on our clusters which are eks 1.27

Attach logs

  • Logs have been collected and sent to k8s-awscni-triage@amazon.com with subject : Urgent :: Github issue https://github.com/aws/amazon-vpc-cni-k8s/issues/2743

What you expected to happen:

  • Pods should always be able to create and terminate on any node they are scheduled.
    However the aws-node daemonset has in the past week or so started to fail frequently and when it fails liveness/readiness
    probes, it fails to restart rendering the node unusable.When a node has aws-node daemonset failing, the scheduler however
    keeps trying to schedule workloads on it and the container remain stuck in container creating state.
    If you also try to terminate pods that are already running on that node, they fail to terminate and remain in terminating state

How to reproduce it (as minimally and precisely as possible):

  • Issue happens intermittently and often but we dont have control on what node in our EKS fleet this can happen on and how
    frequently.The longest our fleet of nodes in a cluster goes for without any node's aws-node daemonset going down is about
    15 minutes

Anything else we need to know?:

  • When a node's aws-nodedaemonset is down, running the prescribed logs collection script ,sudo bash eks-log-collector.sh` actually hangs

    On some nodes it runs up to point :

    Trying to collect common operating system logs...
    Trying to collect kernel logs...
    Trying to collect modinfo... Trying to collect mount points and volume information...
    
    

    and on some up to point

    Trying to collect common operating system logs...
    Trying to collect kernel logs...
    
    

Environment:

  • Kubernetes version (use kubectl version): 1.27
  • CNI Version : v1.16.0-eksbuild.1
  • OS (e.g: cat /etc/os-release): Amazon linux :: AMI name => amazon-eks-node-1.27-v20231002
  • Kernel (e.g. uname -a):

Looking at logs

Debugged via k8s Slack and concluded that the VPC CNI is not at issue here. The aws-node pod, along with other pods, is failing the readiness probe check due to what appears to be I/O starvation. Other already running pods on these affected nodes experience the same issue.

There is a lot of CPU throttling, I/O throttling, and OOMs happening on this node, and these events are the likely causes of the containerd/kubelet errors.

Thanks @jdn5126 , we will try removing GuardDuty first from just this very busy cluster.
We have also started looking at our limits so that the pods dont burst way too high beyond a node capacity.

I will give feedback on how it goes !

Hi @jdn5126 , removed GuardDuty agent that was consuming a significant amount of resources.
It was also causing contention with mountpoint access.
Things are looking much better and we are moving towards ensuring that our nodes are not too close to full capacity resource wise.
There is a new error we are now getting after things have become more stable

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to create new parent process: namespace path: lstat /proc/0/ns/ipc: no such file or directory: unknown

This error is being addressed in open issue containerd/containerd#9160
In that issue, there is mention that this PR resolves the issue.
However, the PR is merged in containerd v1.7.7 https://github.com/containerd/containerd/commits/v1.7.7 and yet the latest available containerd version for EKS ami's is 1.7.2 , according to https://github.com/awslabs/amazon-eks-ami/releases

Do you know when EKS ami's will be on containerd v1.7.7 or later ?

Glad to hear @edison-vflow. I reached out to the AL2 team to figure out the timeline for updating the containerd version. Will update as soon as I hear back.

@edison-vflow the Amazon Linux 2 team is in the process of bundling containerd 1.7.11, so that will be included in future AMIs. No ECD at this time, but targeting by end of the month

Just wanted to note that the Amazon Linux 2 update to containerd 1.7.11 is still in progress... I am not sure why it is taking them this long

It's finally here with containerd: 1.7.11-1.!
https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240202

Thanks everyone

We're on the latest AMI but however we still encountering sporadic errors during pod startup
image
Environment:

  • Kubernetes version (use kubectl version): v1.25.16-eks-5e0fdde
  • CNI Version : v1.16.0-eksbuild.1`
  • OS (e.g: cat /etc/os-release): Amazon linux :: AMI name => amazon-eks-node-1.25-v20240202

We'll continue investigation as our case probably related to the reactive load during workload scheduling, but I also left some evidences below for anyone else having the same issues:
image


So no we're getting this error for some pods sporadically with StartError state from the Kubernetes perspective:

  Warning  Failed     11m                kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to create new parent process: namespace path: lstat /proc/0/ns/ipc: no such file or directory: unknown

Describe pod gives:

    State:          Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to create new parent process: namespace path: lstat /proc/0/ns/ipc: no such file or directory: unknown
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 01:00:00 +0100
      Finished:     Mon, 05 Feb 2024 09:32:47 +0100

Other pods have this message as well:

    State:          Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: context canceled: unknown
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 01:00:00 +0100
      Finished:     Mon, 05 Feb 2024 07:18:14 +0100

I'm also concerned about Started: Thu, 01 Jan 1970 01:00:00 +0100 ๐Ÿง

@VLZZZ @jdn5126, thanks for the feedback and good news that the new containerd: 1.7.11-1 is out.
We will test and give feddback.

@jdn5126 Thank you, I've created standalone issue just in case
awslabs/amazon-eks-ami#1651

Closing as the containerd version has been updated and there are no active issues in this thread

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.