openedx/event-bus-kafka

[Producer] Ensure that delivery callbacks are processed in timely manner (requires some discovery)

timmc-edx opened this issue · 3 comments

We should ensure that polling is happening at a consistent enough rate on the producer.

Currently, we call poll(0) every time we send an event, which means we will eventually call it for almost every ack that comes back. But there are two issues we suspect with this method:

  • If events are sent at a slow or variable rate, acks may take a long time to be processed as callbacks/stats/???
  • The last event sent before server shutdown is guaranteed not to have its ack processed

The other recommended method for handling producer polling is to call poll(0) every N seconds in a separate thread. This would decouple acks/error reporting reliability from signal production rate and ensure that the last events have their acks processed (unless server shutdown happens too soon after; see #11 for details there.)

A/C:

  • Discovery:
    • Understand the implications of:
      • Failing to process some callbacks (at end of server lifecycle)
      • Delayed processing (due to infrequent events)
      • Just not calling poll at all
        • Answer: Messages are kept in the queue until their delivery report callback is run, so failing to call poll at all will eventually produce errors:

          To avoid ERR__QUEUE_FULL errors, make sure to call poll regularily.

    • Pick a method, or decide we don't care about acks and delivery callbacks
  • Implementation:
    • Implement the decision, along with an ADR

References:

Rather than an ADR, I ended up with a docstring on the polling loop: https://github.com/openedx/event-bus-kafka/pull/39/files#diff-3134bce95bcf050a2ac1747f3bceb1acad0b4f6ac393bea3d2bf9b133241dcdeR228

Not sure it's worth a full ADR since it's a pretty localized decision.

(But we also didn't establish exactly what the consequences would be of not having this polling loop, and decided it would be easier to include than to decide whether to include it, which might be ADR-worthy...)

Completed work with:

  • #39 -- polling loop
  • #41 -- ADR for the above