External NodePort services got killed and recreated for every several seconds.
daisywang-ca opened this issue · 6 comments
Describe the bug
We are using koperator 0.22.0. After installed a kafka cluster with two borkers, the kafka-cluster is stuck in ClusterReCounciling state
NAME CLUSTER STATE CLUSTER ALERT COUNT LAST SUCCESSFUL UPGRADE UPGRADE ERROR COUNT AGE
kafka-cluster ClusterReconciling 0 0 3d19h
And the two nodePort services restart every several seconds.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kafka-cluster-0-external NodePort 10.233.5.235 <none> 9094:30091/TCP 8s
kafka-cluster-1-external NodePort 10.233.20.3 <none> 9094:30092/TCP 8s
kafka-cluster-cruisecontrol-svc ClusterIP 10.233.35.58 <none> 8090/TCP,9020/TCP 3d19h
kafka-cluster-headless ClusterIP None <none> 29092/TCP,29093/TCP,9094/TCP,9020/TCP 3d19h
kafka-operator-alertmanager ClusterIP 10.233.11.110 <none> 9001/TCP 3d19h
kafka-operator-authproxy ClusterIP 10.233.63.33 <none> 8443/TCP 3d19h
kafka-operator-operator ClusterIP 10.233.10.195 <none> 443/TCP 3d19h
The client got disconnected constantly from the broker.
** Steps to reproduce the issue**
Installed the kafka-operator and kafka-cluster with version 0.22.0
Additional context
We suspect it's caused by 28a1168, the two services got deleted and recreated with every reconciling flow.
@daisywang-ca thanks for reporting the issue and we will look into it
@daisywang-ca I think what you suspected was correct, this bug is caused by the deleteNonHeadlessServices
function that was introduced by the commit that you've linked
I'm encountering the same issue here. With my previous installation using version v0.21.2, nodePorts worked properly. After upgrading, the services type nodePort for each broker are being deleted and recreated every few seconds, causing instability in the cluster. I tried a fresh installation and still faced the same issue.
Thanks for confirming the issue, @fquinino. We will try to fix the issue ASAP and drop a patch release
BTW, @daisywang-ca @fquinino do you guys get to join our Slack channel where we can better communicate issues like this (and for some fun)?
@daisywang-ca @fquinino v0.23.1 has the bug fix for this, please upgrade the operator versions accordingly