How should the Hyperbahn network recover if a single worker is Busy

Question

How should the Hyperbahn network recover if a single worker is Busy

Raynos opened this issue 9 years ago · 6 comments

Currently if a single worker is TotalBusy due to co-tenacy issues it will return Busy frames back to the edge.

It's expected that the edge should retry somewhere else.

However by retrying on Busy we open ourselves up to cascading failures, we also have not implemented work shedding yet. We still want to retry as a single worker failing should be invisible to edge users.

One solution to this problem involves a few pieces:

Take into account the number of busy frames when doing peer selection. This will make any given node favor peers that are not busy
Continue retrying worker busy errors elsewhere.
On any given sub channel, if the majority of its peers are busy start work shedding the busy errors back to the client / edge. At this point we've run out of capacity and the edge will have to give for the network to recover.

Answer 1 · 2015-09-12T15:31:19.000Z

Idea: for a subchannel, if there are any other peers that are "not recently
busy", transform busy frames into retriable error frames (declined) at both
ingress and egress. Only transport a busy frame if the entire downstream
cluster is busy.

This involves tracking a time decaying busy score on each peer for both
load balancing and saturation detection.

Busy is still a signal for exponential backoff. We should still pursue a
change in latency signal for flow control, which should help us avoid busy
frames in most cases.

On Fri, Sep 11, 2015 at 5:15 PM Jake Verbaten notifications@github.com
wrote:

Currently if a single worker is TotalBusy due to co-tenacy issues it will
return Busy frames back to the edge.

It's expected that the edge should retry somewhere else.

However by retrying on Busy we open ourselves up to cascading failures, we
also have not implemented work shedding yet. We still want to retry as a
single worker failing should be invisible to edge users.

One solution to this problem involves a few pieces:

Take into account the number of busy frames when doing peer
selection. This will make any given node favor peers that are not busy

Continue retrying worker busy errors elsewhere.

On any given sub channel, if the majority of its peers are busy
start work shedding the busy errors back to the client / edge. At this
point we've run out of capacity and the edge will have to give for the
network to recover.

—
Reply to this email directly or view it on GitHub
#1305.

Answer 2 · 2015-09-12T21:36:42.000Z

@kriskowal does not help if the ingress is rate limited. Clients still see busy and an ingress does not know what the health is.

This needs to be applied to hyperbahn client itself. However transforming busy into declined is going to be confusing from a metrics point of view.

Maybe the total rate rate limiter should return unhealthy

Answer 3 · 2015-09-14T18:09:11.000Z

https://github.com/uber/tchannel/blob/master/node/errors.js#L775

Busy is retriable, same as declined. Unhealthy is not.

Answer 4 · 2015-09-14T19:44:38.000Z

We should probably retry on Unhealthy

Answer 5 · 2015-09-14T19:44:42.000Z

cc @kriskowal ^

Answer 6 · 2015-09-15T19:10:09.000Z

Brace yourself for weirdness.

The circuit breaker sends Declined errors when it is Unhealthy. Declined can be retried.

I added the Unhealthy error type to the protocol in anticipation of retry-on-unhealthy having bad consequences. The circuit breaker doesn’t use it right now. We could remove it and pretend it never happened.