kubelet doesn't restart if crashed
alexku7 opened this issue · 4 comments
What happened:
In case the kublet has been crashed, the node goes to NotReady status. The kubelet never restart. The pods on the affected node are stuck in terminating status forever.
What you expected to happen:
The kubelet will restart ( Btw it happens on Azure AKS and Google GKE)
Alternatively : The pods will be evicted forcefully and moved to another healthy node. This is especially critical for statefullsets
How to reproduce it (as minimally and precisely as possible):
simply open a shell on node and kill the kubelet by command: kill
Anything else we need to know?:
Tests on Amazon linux 2 and Amazon linux 2023 nodes. Latest AMi 1.29.3-20240506
simply open a shell on node and kill the kubelet by command: kill
That isn't simulating a crash, that sends a SIGTERM to the process, kubelet
traps the signal and terminates normally (with an exit code of zero). If you send a SIGKILL (kill -9
), you get the auto-restart that you'd expect, which is what happens when kubelet
exits unexpectedly due to a panic, etc.:
> systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; preset: disabled)
Active: active (running) since Fri 2024-05-10 18:10:35 UTC; 2min 7s ago
Docs: https://github.com/kubernetes/kubernetes
Process: 2744679 ExecStartPre=/sbin/iptables -P FORWARD ACCEPT -w 5 (code=exited, status=0/SUCCESS)
Main PID: 2744680 (kubelet)
Tasks: 12 (limit: 18811)
Memory: 29.3M
CPU: 1.407s
CGroup: /runtime.slice/kubelet.service
└─2744680 /usr/bin/kubelet --node-ip=192.168.174.139 --cloud-provider=external --hostname-override=ip-192-168-174-139.us-west-2.compute.internal --config=/etc/kubernetes/kube>
> sudo kill -9 2744680
> systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; preset: disabled)
Active: activating (auto-restart) (Result: signal) since Fri 2024-05-10 18:13:02 UTC; 1s ago
Docs: https://github.com/kubernetes/kubernetes
Process: 2744679 ExecStartPre=/sbin/iptables -P FORWARD ACCEPT -w 5 (code=exited, status=0/SUCCESS)
Process: 2744680 ExecStart=/usr/bin/kubelet $NODEADM_KUBELET_ARGS (code=killed, signal=KILL)
Main PID: 2744680 (code=killed, signal=KILL)
CPU: 1.526s
I will try but it doesn't so matter
We experienced an issue (twice) when the kubelet has been crashed because of memory pressure and never been recovered.
As return the pods were stuck forever. The only way to recover is deleting manually the EC2 node in the AWS console.
GKE and AKS implement a simple kublet watchdog. The kubelet restarts immediately