awslabs/amazon-eks-ami

kubelet doesn't restart if crashed

alexku7 opened this issue · 4 comments

What happened:
In case the kublet has been crashed, the node goes to NotReady status. The kubelet never restart. The pods on the affected node are stuck in terminating status forever.

What you expected to happen:
The kubelet will restart ( Btw it happens on Azure AKS and Google GKE)
Alternatively : The pods will be evicted forcefully and moved to another healthy node. This is especially critical for statefullsets

How to reproduce it (as minimally and precisely as possible):
simply open a shell on node and kill the kubelet by command: kill

Anything else we need to know?:
Tests on Amazon linux 2 and Amazon linux 2023 nodes. Latest AMi 1.29.3-20240506

simply open a shell on node and kill the kubelet by command: kill

That isn't simulating a crash, that sends a SIGTERM to the process, kubelet traps the signal and terminates normally (with an exit code of zero). If you send a SIGKILL (kill -9), you get the auto-restart that you'd expect, which is what happens when kubelet exits unexpectedly due to a panic, etc.:

> systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
     Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; preset: disabled)
     Active: active (running) since Fri 2024-05-10 18:10:35 UTC; 2min 7s ago
       Docs: https://github.com/kubernetes/kubernetes
    Process: 2744679 ExecStartPre=/sbin/iptables -P FORWARD ACCEPT -w 5 (code=exited, status=0/SUCCESS)
   Main PID: 2744680 (kubelet)
      Tasks: 12 (limit: 18811)
     Memory: 29.3M
        CPU: 1.407s
     CGroup: /runtime.slice/kubelet.service
             └─2744680 /usr/bin/kubelet --node-ip=192.168.174.139 --cloud-provider=external --hostname-override=ip-192-168-174-139.us-west-2.compute.internal --config=/etc/kubernetes/kube>

> sudo kill -9 2744680

> systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
     Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; preset: disabled)
     Active: activating (auto-restart) (Result: signal) since Fri 2024-05-10 18:13:02 UTC; 1s ago
       Docs: https://github.com/kubernetes/kubernetes
    Process: 2744679 ExecStartPre=/sbin/iptables -P FORWARD ACCEPT -w 5 (code=exited, status=0/SUCCESS)
    Process: 2744680 ExecStart=/usr/bin/kubelet $NODEADM_KUBELET_ARGS (code=killed, signal=KILL)
   Main PID: 2744680 (code=killed, signal=KILL)
        CPU: 1.526s

I will try but it doesn't so matter
We experienced an issue (twice) when the kubelet has been crashed because of memory pressure and never been recovered.
As return the pods were stuck forever. The only way to recover is deleting manually the EC2 node in the AWS console.

GKE and AKS implement a simple kublet watchdog. The kubelet restarts immediately

If kubelet crashes, it's going to be restarted by systemd, see my example above. It sounds like you're running into problems with kubeReserved memory being insufficient, which is legit. That's discussed in #1141 and #1145.

Thanks
I think this is what we actually had
#1145