jet/kafunk

Consumer does not recover from leaderless partition

Closed this issue · 2 comments

The following behavior was observed when running Kafunk version 0.1.8.

We have a consumer group that contains a single process consuming a single topic (several other consumer groups exhibited the issue described herein, but this was the simplest example). We experienced a partial Kafka cluster outage during which the leader broker of partition 3 died (node_id=106). During this time Kafunk detected a leaderless partition (3) and did not attempt to consume it, but when the broker became healthy again, our consumer did not notice.

The attached log shows a period of time during which the consumer was assigned partition 3, then it restarts a few times, and in the last ~50 lines we see the warning about the leaderless partition. Afterward it goes into a normal fetch/consume/commit offsets loop, but completely ignoring partition 3. Several hours later we manually restarted the process to force Kafunk to notice that partition 3 now had a healthy leader.

We would have expected a metadata update from the cluster to inform the Kafunk consumer that it now had a leader for partition 3, but this did not happen.

Consumer log: kafunk_consumer.log

Thanks for reporting, I'll take a look.

Should be addressed in #213 it looks like it wasn't explicitly removing the affected partitions from the metadata view.