wvanbergen/kafka

some partitions have no owner

Huang-lin opened this issue · 4 comments

Hi,
I find a problem that when I restart my consumers, a partition has no owner.

Consumer group will trigger rebalance when I stop a consumer. Its timeout in function finalizePartition is cg.config.Offsets.ProcessingTimeout, and retrial times is also cg.config.Offsets.ProcessingTimeout in rebalancing.

It should finish some logical processing which it was doing and running function finalizePartition before a consumer stop. This operating maybe spend time more than cg.config.Offset.ProcessingTimeout, so rebalancing will maybe fail, and some partitions have no owner.

To solve this problem, maybe we can add a goroutine to watch partition's owner, and it can also avoid some problem when partition numbers make changes such as kafka broker capacity expansion.

@wvanbergen I have the same trouble.

In my experience, the most common reason for partitions not being consumed is locks still being present in zookeeper because a previous consumer was not shut down properly. After the zookeeper connection timeout, these locks should be garbage collected and the consumergroup should pickup the partitions. However, it's possible there are bugs in this.

It's unlikely that I'll be working on this myself any time soon. I am happy to accept patches for this though.

I think watching partition owner is difficult because we don't know when consumer finish registering owner to zookeeper. It looks like a better method to solve this trouble that trigger rebalance after claim partition failed. I have finished this patch and created pull request.

There are now three PRs open which attempt to address roughly the same problem with varying amounts of change:
#88
#101
#103

@wvanbergen At least in my experience with this particular issue, I've seen two race conditions. I've documented them in the comment on #101.