zendesk/racecar

producer never recovers after network split

breunigs opened this issue · 2 comments

I have no good steps to reproduce, especially since this requires breaking the network (or killing the broker) in the right way and is potentially timing related. I'm especially hazy on how to get into this state, though I'm fairly certain on the effects this causes in racecar. I'll provide the investigated variant, though I have more production logs available on demand.

Roughly:

  1. racecar is up and operating fine
  2. introduce a network split to lose connection to all brokers. Everything will be considered down from librdkafka point of view.
  3. eventually restore connection again
  4. racecar will clear stuck rdkafka consumers through ConsumerSet#reset_current_consumer in certain scenarios. This was enough to eventually receive messages again.
  5. publish a message (in reaction to the just received ones). Adding it to librdkafka's queue will work. However, it will fail in Consumer#deliver! since there is no broker connection, so rdkafka eventually throws a WaitTimeoutError. Note that the method exits before calling @delivery_handles.clear. This is fine, a handle can be waited upon multiple times.
  6. The error bubbles into the pause handling, pausing this topic-partition for some time.
  7. eventually we will receive another message; doesn't matter which partition it is from. Eventually it will try to @delivery_handles.each(&:wait) again, which waits on the first message described in step 5.
  8. after enough retries, and once message.timeout.ms is exceeded, librdkafka will throw msg_timed_out which means that it has given up. The message will not be sent.

At the very least, racecar should remove these finally failed messages/handles from @delivery_handles. Even if the producer eventually recovers, these failed messages will result in head of line blocking. Of course, as is today, racecar should still fail on those so that the single message or the whole batch block gets retried.

That being said, this might not be enough. From the statistics librdkafka makes available, I know that all producer broker connections were in state: "INIT". I still have to check if this is the default and librdkafka only connects on first use, or if it indicates librdkafka was additionally stuck for another reason.

(I intend to provide a PR on this, but I need to finish the investigation first)

nope, even for consumer only the producer will be in state: "UP" immediately after boot. So I'd say the producer was additionally stuck. I propose the following:

  1. if we run into unrecoverable producer errors like msg_timed_out, we reset the producer in a similar fashion we already to for the consumers – close them, and recreate.
  2. we clear the @delivery_handles whenever the producer is (re)set.

I'll see how to get that in without breaking API changes, since former is handled in runner.rb, but latter are on the consumer.rb. The similar handling for consumers in the consumer_set.rb, so this will be some fun.