uber/tchannel

How should the Hyperbahn network recover if a single worker is Busy

Raynos opened this issue · 6 comments

Currently if a single worker is TotalBusy due to co-tenacy issues it will return Busy frames back to the edge.

It's expected that the edge should retry somewhere else.

However by retrying on Busy we open ourselves up to cascading failures, we also have not implemented work shedding yet. We still want to retry as a single worker failing should be invisible to edge users.

One solution to this problem involves a few pieces:

  • Take into account the number of busy frames when doing peer selection. This will make any given node favor peers that are not busy
  • Continue retrying worker busy errors elsewhere.
  • On any given sub channel, if the majority of its peers are busy start work shedding the busy errors back to the client / edge. At this point we've run out of capacity and the edge will have to give for the network to recover.

Idea: for a subchannel, if there are any other peers that are "not recently
busy", transform busy frames into retriable error frames (declined) at both
ingress and egress. Only transport a busy frame if the entire downstream
cluster is busy.

This involves tracking a time decaying busy score on each peer for both
load balancing and saturation detection.

Busy is still a signal for exponential backoff. We should still pursue a
change in latency signal for flow control, which should help us avoid busy
frames in most cases.

On Fri, Sep 11, 2015 at 5:15 PM Jake Verbaten notifications@github.com
wrote:

Currently if a single worker is TotalBusy due to co-tenacy issues it will
return Busy frames back to the edge.

It's expected that the edge should retry somewhere else.

However by retrying on Busy we open ourselves up to cascading failures, we
also have not implemented work shedding yet. We still want to retry as a
single worker failing should be invisible to edge users.

One solution to this problem involves a few pieces:

  • Take into account the number of busy frames when doing peer
    selection. This will make any given node favor peers that are not busy
  • Continue retrying worker busy errors elsewhere.
  • On any given sub channel, if the majority of its peers are busy
    start work shedding the busy errors back to the client / edge. At this
    point we've run out of capacity and the edge will have to give for the
    network to recover.


Reply to this email directly or view it on GitHub
#1305.

@kriskowal does not help if the ingress is rate limited. Clients still see busy and an ingress does not know what the health is.

This needs to be applied to hyperbahn client itself. However transforming busy into declined is going to be confusing from a metrics point of view.

Maybe the total rate rate limiter should return unhealthy

https://github.com/uber/tchannel/blob/master/node/errors.js#L775

Busy is retriable, same as declined. Unhealthy is not.

We should probably retry on Unhealthy

Brace yourself for weirdness.

The circuit breaker sends Declined errors when it is Unhealthy. Declined can be retried.

I added the Unhealthy error type to the protocol in anticipation of retry-on-unhealthy having bad consequences. The circuit breaker doesn’t use it right now. We could remove it and pretend it never happened.