hashicorp/terraform-provider-aws

25 max_retries is really long for a default

Closed this issue · 2 comments

I've been debugging an issue the past few days where operations on lambda functions were hanging forever. The real root cause eventually turned out to be a crappy corpware DNS resolver returning non-RFC compliant replies that the golang net library didn't like. However, this still turned up terraform issues that significantly delayed me in getting to the root of the problem.

  1. 25 max_retries as default seems like way too many. If the request itself takes any significant amount of time to fail, you can have a resource stuck spinning for over an hour. Especially because the default retry logic in the SDK is exponential.

The retry logic from aws-sdk-go/aws/client/default_retryer is effectively 2**min(retryCount,13) * randint(30) + 30) (in milliseconds). (Some of the numbers change to scale up faster if the error is AWS telling the client to throttle)

The start of that function is pretty gentle. Assuming an average of 15 on the rand call, it's not until the 8th attempt that you are spending more than a second between calls. But after that, it scales quickly, as exponentials are wont to do. And by the 13th call, you've hit the scaling cap and spend an average of 2 minutes between calls. Summing up over the 25 retries, you can expect to wait an average of 26 mins on a failing request plus whatever time it takes all 25 requests to fail (which can be substantial when the failure involves a request timeout).

Full sequence of average wait times between each request (in seconds):

[0.045, 0.06, 0.09, 0.15, 0.27, 
 0.51, 0.99, 1.95, 3.87, 7.71, 
15.39, 30.75, 61.47, 122.91, 122.91, 
122.91, 122.91, 122.91, 122.91, 122.91, 
122.91, 122.91, 122.91, 122.91, 122.91]

10 (total ~15s sleeping) to 12 (total ~60s sleeping) as max_retries seems like a better default. Maybe 14 tops where we can expect terraform to wait an average of 4 minutes before bailing.

  1. The above might be okay, but the provider doesn't configure the SDK to log the reasons why it's retrying, so you need to wait the full timeout before the underlying error is bubbled up.

Terraform Version

0.9.11 and 0.10-1rc1 (w/ aws 1.2)

Example HCL

I used the verbatim HCL from https://www.terraform.io/docs/providers/aws/r/lambda_function.html, but any old HCL will do

Steps to Reproduce

  1. terraform refresh/apply
  2. While, terraform is running, do something to induce connection errors. For instance, I set my nameserver to 127.0.0.1
bflad commented

The fix to lower the threshold for DNS resolution errors (#4459) was previously released in version 1.18.0 of the AWS provider and has been available in all releases since. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

In general though, the provider max_retries argument can be tuned as appropriate for your environments. We have found over the years that operators of larger environments appreciate the larger default value and in many scenarios the default AWS Go SDK handling for retries is correct (e.g. retrying throttling errors).

If there are specific cases where the retry logic is not working as expected, please feel free to create a new Bug Report issue filling out the relevant details and we will further triage. Thanks. 👍

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!