failsafe-lib/failsafe

Support accrual failure detection

jhalterman opened this issue ยท 4 comments

As Failsafe already supports policies that are useful for networked operations, it would make sense to support phi accrural (or other accural algorithms) failure detection for situations where fixed timeouts don't adequately account for changing load conditions.

This could be implemented as a new policy which measures execution times over a number of executions, to determine if some threshold is crossed which represents a failure. Phi accrual could be one strategy supported by the policy, but there could be others. When the threshold is crossed, a fallback-like function could be called, for example, to fail over a system from one node that has failed to another. In that sense, the policy would be like a time-based fallback (rather than result based), except unlike a fallback it would be stateful.

Alternatively, this could be implemented as a Timeout option, where the timeout is stateful and adapts to execution time distributions.

One open question for this policy is, similar to a circuit breaker or rate limiter, at what point should it "reset" after triggering a failure, or should it even reset?

Any ideas for how this should work or what the policy should be named are welcome!

accural -> accrual

For some reason my fingers always struggle with that one :)

This is definitely a sign that the new policy should not be named PhiAccrual :) I like the idea of thinking about a new policy more generally, as something that measures a series of execution times, where phi accrual is maybe just one strategy for determining if those times represent a failure.