kontent-ai/java-packages

Improve retry policy for requests

tobiaskamenicky opened this issue · 3 comments

Motivation

From February 1st, 2020 Delivery API will start enforcing rate limits to ensure fair usage by as many clients as possible. These rate limits should be high enough to work for most clients by default. However, in particular cases Delivery API might respond with 429 Too many requests. These responses will also contain the Retry-After header indicating how long the client should wait before making a follow-up request.

Proposed solution

Implement a retry policy that handles 429 Too many requests and retries the HTTP call using information from an optional Retry-After header. It can contain a date after which to retry or the seconds to delay after the response is received. We recommend a retry policy with exponential backoff and a jitter strategy to overcome peaks of similar retries coming from many clients. The retry policy should handle the following status codes:

  • 408 Request Timeout
  • 429 Too many requests
  • 500 Internal Server Error
  • 502 Bad Gateway
  • 503 Service Unavailable
  • 504 Gateway Timeout

Please note that both 429 Too many requests and 503 Service Unavailable responses might contain an optional Retry-After header.

Additional context

The current implementation performs a binary exponential backoff:
https://github.com/Kentico/delivery-sdk-java/blob/1d4e94f8a6cabc6cb4275e370700b9db98b1c278/delivery/src/main/java/kentico/kontent/delivery/AsyncDeliveryClient.java#L507

Where counter is based on the number of attempts. The max number of retry attempts is configurable and defaults to 3.

This certainly could be updated to using a random with a backoff if that is the standard for all the SDKs.

As far as a MaxCumulativeWaitTime set to default to 30 seconds... I'm open to discussing that, but in my experience with my previous employers use cases, this would be very detrimental and could hinder consumers from failing fast.

I concur that if a Retry-After header is sent, that it should be respected. The current implementation does not read that header.

I updated the description of the issue. The main motivation is to prepare SDKs for a rate limitation that is planned in Delivery API. Therefore, retry policy (regardless the implementation) needs to handle the mentioned status codes and respect "Retry-After" header for 429 and 503.

Adam, thank you for your feedback. As Tobias already mentioned, the main goal is to prepare clients for changes in Delivery API that will take effect on February 1st, 2020. To do so each SDK should implement a retry policy that supports the defined range of HTTP status codes and the Retry-After header. SDK authors are free to choose the implementation or handle additional error conditions as the .NET SDK already does.

Regarding the waiting time, it is a difficult to make as different users have different expectations. We see two major use cases; integrations and web applications. While integrations can usually take as much time as they need, web applications might fail fast because they can display an error message. And as you pointed out, it might be better to display an error message (or take another action) rather than provide no response at all. However, if we had decided to favor web applications, what would be the definition of failing fast? Would it be 1 second or 1 minute? It seems there is no right value and the result is a compromise.

We believe that developers should customize the retry policy for each application and we could only provide them with a few hints. Also, we are currently working on the best practices for developers who use the Delivery API, and this is definitely a topic that will be included.

Could you please share more information about your experience with previous employers' use cases?