Circuit Breaker to stop processing records if HTTP endpoint errors cross a threshold
masaldaan opened this issue · 9 comments
This is more of a question/suggestion in the vein of #82.
How can I stop/pause the processing/polling of records if the downstream system (e.g. the Slow GPS tracking system in the examples) starts returning 5xx errors?
I could add retries as suggested in #82 (which I really liked, kudos), but there might be a threshold, say 50 5xx errors in 5 secs, at which point, you'd like to pause processing any records.
You could then decide to restart after a pre-decided cooldown period, or have a switch that needs to be flipped manually after which you restart.
I could implement the circuit breaking myself, but I'm having trouble visualising how the pause/resume might work for parallel consumers.
I'd appreciate any pointers/help.
Many thanks!
Any records from a partition or topic? Or any records of a certain key or field?
We could implement a stage in the system upon ingestion of messages, and retrieval of work, to preemptively conditionally schedule the work- before waiting for the failure. And again when it’s scheduled it could check a switch - which could be derived from target system.
That is a good question, I was thinking of any records from a partition for a particular consumer group (since in the scenario I was envisioning, a consumer group services a single downstream system)
Conditional scheduling is actually how I'm handling it in vanilla & Spring-based Kafka consumers, but I am still reading up on offset management in parallel-consumers, so I did not want to offer up half-baked solutions.
Ah ok, pausing everything is a bit different, much easier and would would be more efficient. I’ve got a couple of ideas for both - I’ll push up a draft interface in a couple days - let me know what you think.
For everything - basically the controller just needs to stop taking work. The broker polled will pause things automatically and resume things again once the controller starts taking work again.
Hi @astubbs, I'll gladly try & help out with this issue. It will take me some time to actually get up to speed with the internals though, I hope that's alright.
FYI, the easiest way to do this, is to wrap you user functions, in a function which cheeks the return result of the user function - or tests something, to see what the target host name is. Check if the hostname is in a map of disabled hosts, and if so - fail the function immediately (throw any exception). This will cause the message to go back into the queue and eventually be retried (you can plug in a custom retry delay calculator here too). Effect being that the messages will just be retried forever, but you can skip message processing immediately and fail fast. No changes to the framework are required. Let me know your thoughts..
FYI here's documentation for how you'd do this, with an example: https://github.com/confluentinc/parallel-consumer/tree/master#circuit-breaker-pattern