Retry with delay and backoff for some HTTP responses
Kalmalyzer opened this issue · 2 comments
I am using desync against Google Cloud Store. I configure GCS's REST API as a chunk store. If I upload 100MB of test data then the GCS backend responds with a 502 Bad Gateway partway through the upload.
I'm not sure why; I think it is because desync attempts to push more data than GCS can handle via a single TCP connection. Regardless of the reason, would it make sense to have a retry mechanism on some failures? (most/all of 5xx error codes are potentially transient and should be retryable.)
For example, https://github.com/hashicorp/go-retryablehttp implements a default backoff model of N seconds delay before the Nth attempt.
(Perhaps it would be a good idea to swap out raw net/http with go-retryablehttp rather than fleshing out a homemade retry mechanism?)
One implementation of this is available in #161 .
There is now an optional retry mechanism in place for HTTP/HTTPS stores.
The code base used to retry network-level failures (DNS lookup failures, TCP connection establishment failures, SSL handshake fail, TCP stream breaking mid-request, ...). #161 extends that retry to apply to 5xx class HTTP responses as well.
Retries are instant by default, but the user can activate a linear backoff (after error N, the corresponding goroutine will sleep for N*base_interval nanoseconds before retrying).
With the retry mechanism active, I have successfully used Google Cloud Storage's REST API for storing chunks (100MB test suite, tests run a half-dozen times). The intermittent 502s have not resulted in any application-level failures in desync so far.
Closing!