martomi/chiadog

[FR] Notification backoff - avoid flooding notifications during prolonged outage

Opened this issue · 4 comments

TL;DR; a prolonged harvester outage of say 8h during sleep will produce a notification every 5 minutes per harvester. This gets annoying to wake up to, and also may consume limited resources / credits of the notification service. Instead an exponential backoff should be used which can easily drop consumption by 90% while still providing much the same level of urgency to an operator.

Steps to produce suboptimal behavior:

  1. Full Node craps itself out at 01:00. The operator is sound asleep and does not react to the notifications received from harvesters: "Your harvester appears to be offline! No events for the past 400 seconds."
  2. Harvesters continue not being able to connect and continue sending the notification every 5 minutes. With 2 remote harvesters sending a notification every 5 minutes during sleep time this equals a total of 192 alerts sent over 8 hours of sleep. This is not a hypothetical but exactly what happened to me. :-)

Better behavior:

  1. Full Node craps itself out at 01:00. Operator is still blissfully sleeping and ignoring notifications.
  2. Every harvester alerts: "Your harvester appears to be offline! No events for the past 400 seconds."
  3. Instead of a fixed 5 minute alert threshold, use an exponential backoff algorithm. Values can be played with for example here: https://exponentialbackoffcalculator.com/ .. I'll use as an example here:
    • Interval: 300 seconds
    • Max retries: 11 (just for illustrative purposes, no need to implement a max retry)
    • Exponential: 1.5
  4. With the values above, notifications would be triggered with the following deltas since the start of the incident (not cumulative):
    • 5m
    • 12.5m
    • 23.75m
    • 40m
    • 66m
    • 104m
    • 2.7h
    • 4.1h
    • 6.2h
    • 9.4h
  5. As an end result, from 2 harvesters our operator would wake up to 20 notifications instead of 192. Still clear that the issue is long term and ongoing but has only consumed 10% of the resources that the naive approach does. Also at the start of the incident notifications are still received fairly often.

Solution summary:

  • Decide on an exponential that makes sense, my examples above use 1.5 and produces 90% savings over the example scenario with very little noticeable effect for most alert scenarios.
  • Calculate thresholds for notifications with the following formula: T() = interval * exponential_rate ^ retryNumber

Good feature design 👍 I'd be happy to upstream this feature if someone has worked or plans to work on it!

Looking at the code responsible for this, it seems to me the keep-alive threshold should still define the check interval but then the modification would be to:

  1. In keep_alive_monitor keep & increase an event counter when the threshold is met and reset once it's no longer met.
  2. Calculate notification thresholds with the formula from the counter.
  3. Not trigger event creation unless sufficient delta has passed, OR pass in a notify=False flag which would cause the event to be logged, but not sent to other notification channels. (Which could be a useful feature for other things as well. Also might be less confusing for someone not aware of the backoff if the logs say "offline for 50000 seconds but not notifying until Y seconds to avoid flooding you.")

Doesn't seem too difficult to implement. If the above plan sounds ok, I should be able to eventually knock something out. My immediate issue is gone though via automatic service restarts based on monitoring TCP 8444. :-)

Yes, sounds good, I like the idea to still log out all the events with informations for the reason of missing notification. I think it's OK to handle this entirely in the keep_alive_monitor component because the notify_manager doesn't have any other function beside notification at the moment (in regards to passing notify=False if I understood that correctly).

That is unless we want to have a more general notification throttling that works across all event types. But as I understand it, this is currently the most offensive notification type.

Alright, so I'll try to implement a really targeted fix for the keep-alive but keep in mind that it could be refactored later into a more generic solution.