fiaas/k8s

Client retries for API-server errors

Closed this issue · 3 comments

We've been seeing issues where the API-server returns errors when fiaas-deploy-daemon is deploying things:

  1. Rate-limiting errors when bulk-deploying new fiaas-deploy-daemon configs for a number of namespaces
  2. Errors deploying ingresses during deploys of ingress-controllers by the cluster operators

It seems like having some retry-with-backofff in these situations would be helpful. One doubt is about the level this should live at, but I think the simplest implementation will be in the HTTP client used here: with requests, it can be as simple as supplying a retry-config, that enumerates the status-codes to retry for. Something like this, in k8s.client:

session = requests.Session()
retry_statuses = [requests.codes.too_many_requests, 
                             requests.codes.internal_server_error, 
                             requests.codes.bad_gateway,
                             requests.codes.service_unavailable,
                             requests.codes.gateway_timeout]
retries = Retry(total=10, backoff_factor=1, status_forcelist=retry_statuses, method_whitelist=False)
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))

This will retry for the listed statuses, on all HTTP methods, 10 times with a growing backoff,
starting at 0, then 2s, 4s, 8s etc. until the default max of 120s.

Does this seem reasonable?

I think this makes a lot of sense, and would be useful to have in the client. There is a document essentially suggesting this behavior (exponential backoff) for many 500 range / server failure modes: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#error-codes

For the error types you've included in the example above, I think it is fine to implement the retry behavior in the client.
On a sidenote, there already is some retry behavior for 409 Conflict responses in fiaas-deploy-daemon, but it is implemented on top of the client calls. This error is a client/concurrency error though, and maybe it makes sense that it should be handled explicitly .

Rate-limiting errors when bulk-deploying new fiaas-deploy-daemon configs for a number of namespaces

The document I linked above suggests that clients "Read the Retry-After HTTP header from the response, and wait at least that long before retrying." if being rate limited with 429 StatusTooManyRequests

Ok. From the docs, it looks like it respects Retry-After by default for 429 statuses; I'll double-check that the other options don't interfere with that default behaviour