strimzi/strimzi-canary

Configure canary to not do dynamic reassignment during periodic reconcile

ppatierno opened this issue · 0 comments

I stumbled into a weird scenario when a Strimzi upgrade happens driving rolling updates of the ZooKeeper nodes and Kafka brokers to newer images version when the canary is running against this cluster.
What I noticed is that the last rolled Kafka broker has a 100% CPU usage for a long time (even more than a couple of hours) and then it comes back to normal usage later (with no intervention).
After digging into logs and some offline chats with @tombentley (who can provide more details), it seems that it's a replica fetcher thread looping and taking the CPU may be due to a Kafka bug but the main cause of that is the canary doing partitions reassignment when the overall cluster is rolling.
Of course, the canary tool doesn't know that Kafka brokers are disappearing and reappearing due to a rolling update but it just sees the cluster scaling down and up again so applying the partitions reassignment logic during these steps.

In order to mitigate this problem (until the root cause is identified and fixed in Kafka) it could be possible to configure the canary with the "expected" number of brokers that makes the cluster avoiding to do the reassignment; it's the kind of the opposite of the current behavior which is more dynamic by addressing scaling up/down.

On startup the canary could create the topic when the expected broker's number is what it sees on the cluster and even doing reassignment if there is a need due to a scale down/up; anyway it can happen only at startup not during periodic reconcile.
Of course, it means that if there is a scale up/down of the cluster, the canary tool needs to be restarted.