banzaicloud/koperator

Terminating and stucked broker pod cannot be deleted by Koperator

bartam1 opened this issue · 0 comments

Describe the bug
I have experienced a few times on a GKE v1.22 Kubernetes cluster that when a Kafka cluster is up for days somehow some of the broker pods are stuck in a terminating state e.g.:

AME                                                              READY   STATUS        RESTARTS       AGE        NODE
kafka-0-czldf                                                     0/1     Terminating   0              3h36m         pool1-12efb0bb-7nxy

NAME                             STATUS   ROLES    AGE     VERSION
pool1-12efb0bb-6qbg   Ready    <none>   3h36m   v1.22.12-gke.500

The kafka-0-czldf has age 3h36m while the pool1-12efb0bb-6qbg node has same age.
I think the problem is caused by a node down, later when a new one comes back the broker pod is stuck.

In the broker pod I can see terminated containers:

State: Terminated
Reason: Completed
Exit Code: 0

Broker pod deletionTimestamp is "2022-09-25T20:56:10Z"
Because the broker pod is stuck and cannot be deleted the Koperator cannot recreate the broker pod thus the Kafka cluster is not working properly.

Steps to reproduce the issue:
Im going to try to reproduce the issue by hand.

Expected behavior
Koperator checks the broker pods and when there is a terminated container the Koperator deletes the pod and re create it.