Potential bug in pause/resume logic

Question

Potential bug in pause/resume logic

rafer opened this issue 2 years ago · 1 comments

Over the last several weeks I've experienced several occurrences where an individual partition has stopped processing for hours at a time. I think I've tracked this down to a bug in the pause/resume logic in racecar. Here's a breakdown of what I believe is happening:

Consumer A pauses a partition because it encounters processing exceptions. Our consumers frequently encounter a few errors during rebalances due to duplicate processing of messages.
The partition is assigned to another consumer (due to a rebalance)
The pause for that partition expires (we have the pause timeout set to 10s) and we attempt to resume the partition on consumer A, but since it's no longer assigned we don't call resume for that partition on the underlying rdkafka consumer.
The partition returns to consumer A, but we don't process it because we didn't ever call resume on the underlying rdkafka client. That partition is effectively paused on consumer A indefinitely.

When the next deploy/rebalance happens, the partition gets assigned to a new consumer and the backlog is processed.

Note: I'm running zendesk/racecar 2.8.2.

Here are some annotated logs from our consumers that make me suspect the above scenario. These logs are all from a single consumer pod.

# Client is assigned partition 9, starts processing, hits an exception and pauses
# https://github.com/zendesk/racecar/blob/v2.8.2/lib/racecar/runner.rb#L273
[2023-04-21T18:43:49.739109 #1]  WARN -- : Pausing partition our-topic/9 for 10.0 seconds

# Partition 9 is assigned to another consumer because partitions are moving around as we rebalance

# The 10 second partition pause expires and we attempt to resume
# https://github.com/zendesk/racecar/blob/v2.8.2/lib/racecar/runner.rb#L293
[2023-04-21T18:43:59.965817 #1]  INFO -- : Automatically resuming partition our-topic/9, pause timeout expired

# But we don't call resume because we no longer own that partition
# https://github.com/zendesk/racecar/blob/v2.8.2/lib/racecar/consumer_set.rb#L103
[2023-04-21T18:43:59.966079 #1]  INFO -- : Attempted to resume our-topic/9, but we're not subscribed to it

# The partition is assigned back to the consumer but it does not resume processing

I've worked around this issue by using a forked version that calls resume for the topic/partition on all of the clients, without checking for assignment. This appears to have fixed the issue. I'm not sure if this particular fix is something that should be merged because of the downside described in the PR body. I'm happy to open a PR to fix this more generally, but thought it wise to first get some input on how the maintainers would like to handle this (assuming you agree with my diagnosis). The "best" ways I can think of to fix this would be:

Make the Pause object aware of the client that the pause is associated with (so we don't need to iterate through the consumers in ConsumerSet.
Handle resetting pause state (i.e. unpausing the rdkafka consumer) in rebalance callbacks.

What do you all think? Thanks in advance for considering this.

Answer 1 · 2023-06-02T20:34:21.000Z

We've seen the same thing in production recently. Thanks for the fork fix!