openedx/event-bus-kafka

[Consumer] Catch and log exceptions in the consumer loop

timmc-edx opened this issue · 0 comments

Currently (v1.2.0) any exception raised in the consumer loop will kill the loop. We probably don't want errors to kill the loop, as not all deployments will necessarily be restarting the maintenance command on failure. (K8s might also let it die permanently if it crashes repeatedly.)

A/C:

  • try-block at highest level of consumer loop prevents any exception from killing the loop
  • [Note: this only applies if https://github.com//issues/72 is done and we are doing manual commits] Verify consumer offset commits are still sent even when an exception is caught
  • When a message is consumed but cannot be processed or contains an error (ie msg.error() is not null), log the offset, full-topic, key, partition, consumer group, and any other necessary information at the error level so that later replaying would be possible.
    • possibly continue to ignore EOF_PARTITION errors, look at what that actually is
    • consider logging error properties if they may be helpful for debugging
  • Log the identity of any failing receivers (and the key, topic, etc.) if we get as far as sending to the signal but one or more receivers throws. (See return value of send_event.)
  • Update PII-handling ADR to mention that we're also logging PII but that we don't think it's a big concern, because logs are already handled as sensitive information and it would only happen on error.
  • Call edx-django-utils record_exception (or whatever it gets renamed to) when catching any exception (not including failing receivers)
  • Update ADR about this decision to catch and log all exceptions. (Also see producer ADR: https://github.com/openedx/event-bus-kafka/blob/dd96ab4f678653f3aa537d1fb446f2db10cfb094/docs/decisions/0008-baseline-error-handling-for-producer.rst) Possible information to include or consider:
    • We may use DLQs later
    • We may at some point have enough experience to handle some exceptions differently from others
    • There are a bunch of different possible kinds of errors:
      • Kafka errors
        • Poll call fails
        • Poll returns a Message but it has a Kafka error code rather than a value
      • Envelope/encoding issues
        • Header that doesn't match the expected signal type (could be either producer or consumer misconfiguration)
        • Other envelope errors (random bugs like duplicated headers or something)
        • Failure to talk to schema registry
        • Deserialization failures (schema mismatches, library version mismatches, serialization bugs in producer...)
      • Signal receiver errors
  • Update https://openedx.atlassian.net/wiki/spaces/AC/pages/3508699151/How+to+start+using+the+Event+Bus#Error-handling with current information about logging (including anything from the ADR that might be useful, such as a note on DLQs)
  • Update ADR to Accepted once all items are implemented.