banzaicloud/koperator

External NodePort services got killed and recreated for every several seconds.

daisywang-ca opened this issue · 6 comments

Describe the bug
We are using koperator 0.22.0. After installed a kafka cluster with two borkers, the kafka-cluster is stuck in ClusterReCounciling state

NAME            CLUSTER STATE        CLUSTER ALERT COUNT   LAST SUCCESSFUL UPGRADE   UPGRADE ERROR COUNT   AGE
kafka-cluster   ClusterReconciling   0                                               0                     3d19h

And the two nodePort services restart every several seconds.

NAME                              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                 AGE
kafka-cluster-0-external          NodePort    10.233.5.235    <none>        9094:30091/TCP                          8s
kafka-cluster-1-external          NodePort    10.233.20.3     <none>        9094:30092/TCP                          8s
kafka-cluster-cruisecontrol-svc   ClusterIP   10.233.35.58    <none>        8090/TCP,9020/TCP                       3d19h
kafka-cluster-headless            ClusterIP   None            <none>        29092/TCP,29093/TCP,9094/TCP,9020/TCP   3d19h
kafka-operator-alertmanager       ClusterIP   10.233.11.110   <none>        9001/TCP                                3d19h
kafka-operator-authproxy          ClusterIP   10.233.63.33    <none>        8443/TCP                                3d19h
kafka-operator-operator           ClusterIP   10.233.10.195   <none>        443/TCP                                 3d19h

The client got disconnected constantly from the broker.

** Steps to reproduce the issue**
Installed the kafka-operator and kafka-cluster with version 0.22.0

Additional context
We suspect it's caused by 28a1168, the two services got deleted and recreated with every reconciling flow.

@daisywang-ca thanks for reporting the issue and we will look into it

@daisywang-ca I think what you suspected was correct, this bug is caused by the deleteNonHeadlessServices function that was introduced by the commit that you've linked

I'm encountering the same issue here. With my previous installation using version v0.21.2, nodePorts worked properly. After upgrading, the services type nodePort for each broker are being deleted and recreated every few seconds, causing instability in the cluster. I tried a fresh installation and still faced the same issue.

Thanks for confirming the issue, @fquinino. We will try to fix the issue ASAP and drop a patch release

BTW, @daisywang-ca @fquinino do you guys get to join our Slack channel where we can better communicate issues like this (and for some fun)?

@daisywang-ca @fquinino v0.23.1 has the bug fix for this, please upgrade the operator versions accordingly