zalando/skipper

automatic drains based on http response code

bbp-brieuc opened this issue · 2 comments

Hi, I would like backends that have significantly higher HTTP query failure rates than others to be automatically avoided by the load balancer.

It is better than using only health checking in several ways:

  • implementing a correct health checking handler is difficult both technically and organizationally, because it requires first to know all the reasons why a backend could be failing, and second to get the backend developers to add checks for all those reasons and for any new reason in future maintenance (for example when new dependencies are introduced), which is very challenging to obtain even in medium sized teams
  • health checking is a backend local decision: a backend has to decide to declare itself healthy or not without knowing how other backends are doing, which can result in all backends marking themselves unhealthy at the same time, or to half broken, flaky backends not daring to mark themselves unhealthy because for all they know, others might be working even worse

So it's useful to make those decisions in the load balancer by comparing the performance of all backends rather than in the backend itself with limited information, and based on actual query results, rather than on more or less exhaustive health checks.

I offer to implement that feature myself, but I would like first to have initial feedback on the idea. FWIW, I've used that strategy in two different high scale production systems and greatly improved the reliability as a result.

From a cursory look at the code, I think it could be done by calling an optional observer method on the LBAlgorithm after the call to RoundTrip

szuecs commented

We are working on this kind of solution as a passive health check mechanism.
#2346

See also
#2759

A couple of PRs are already merged that help us to get this feature.

I guess we can close this issue as duplicate, right @bbp-brieuc ?

Yes, that sounds similar at least on the surface, thanks!