Rebalance logic breaks under high consumer startup load (ZK 110 node exists or LeaderNotFound or BadOffset )
BadLambdaJamma opened this issue · 0 comments
Rebalance errors for consumers at startup have always been common. The best guidance was to keep restarting the consumer until a successful rebalance occurred among all consumers. In the past this had always worked for me. I typically had three machines with 1 consumer worker processes each. They would typically restart a few times before settling into a N-N consumer to partition ratio and then hum along nicely. In a new project I am starting 90 consumers in effectively a concurrent timescale. In this mode of use, rebalance logic starts to exhibit some VERY weird behaviors that go beyond failing to start or starting but tearing down a ZK node that another consumer was using.
I can see that during a rebalance, the client has de-registered from Zookeeper notifications about consumer changes. (this make sense). But under high startup load, many consumer clients can call register() to register themselves in ZK as they come online, while many other clients have their back turned to zookeeper because they are already in the act of rebalancing. Mayhem ensues as various clients now try to stake their claim in the rebalance operation. Some claim the same znode and error when they try to create it when another consumer already has done so. Some consumers will tear down existing consumers if the leader on the partitions znode was the loser in the rebalance war. Sometimes entire groups of consumers will incorrectly decide that they are the entire consumer group (other consumers are rebalancing) and start reading from offset 0.
It seems as if the solution is to serialize start/rebalance operations across all consumers. Rule: A new consumer can not start (and hence call register and hence trigger a rebalance) if a rebalance is already underway. We could potentially also stay registered for consumer changes in ZK during a rebalance, and invalidate the current rebalance, if it gets sullied by a new consumer joining before completion.