Client retries for API-server errors

Question

Client retries for API-server errors

Closed this issue 5 years ago · 3 comments

We've been seeing issues where the API-server returns errors when fiaas-deploy-daemon is deploying things:

Rate-limiting errors when bulk-deploying new fiaas-deploy-daemon configs for a number of namespaces
Errors deploying ingresses during deploys of ingress-controllers by the cluster operators

It seems like having some retry-with-backofff in these situations would be helpful. One doubt is about the level this should live at, but I think the simplest implementation will be in the HTTP client used here: with requests, it can be as simple as supplying a retry-config, that enumerates the status-codes to retry for. Something like this, in k8s.client:

session = requests.Session()
retry_statuses = [requests.codes.too_many_requests, 
                             requests.codes.internal_server_error, 
                             requests.codes.bad_gateway,
                             requests.codes.service_unavailable,
                             requests.codes.gateway_timeout]
retries = Retry(total=10, backoff_factor=1, status_forcelist=retry_statuses, method_whitelist=False)
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))

This will retry for the listed statuses, on all HTTP methods, 10 times with a growing backoff,
starting at 0, then 2s, 4s, 8s etc. until the default max of 120s.

Does this seem reasonable?

Answer 1 · 2019-11-05T15:02:57.000Z

I think this makes a lot of sense, and would be useful to have in the client. There is a document essentially suggesting this behavior (exponential backoff) for many 500 range / server failure modes: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#error-codes

For the error types you've included in the example above, I think it is fine to implement the retry behavior in the client.
On a sidenote, there already is some retry behavior for 409 Conflict responses in fiaas-deploy-daemon, but it is implemented on top of the client calls. This error is a client/concurrency error though, and maybe it makes sense that it should be handled explicitly .

Answer 2 · 2019-11-05T15:05:48.000Z

Rate-limiting errors when bulk-deploying new fiaas-deploy-daemon configs for a number of namespaces

The document I linked above suggests that clients "Read the Retry-After HTTP header from the response, and wait at least that long before retrying." if being rate limited with 429 StatusTooManyRequests

Answer 3 · 2019-11-05T15:42:23.000Z

Ok. From the docs, it looks like it respects Retry-After by default for 429 statuses; I'll double-check that the other options don't interfere with that default behaviour