mailgun/mailgun-go

Handling rate limits

phillbaker opened this issue · 7 comments

Hi there, thanks for the work on this library.

I support an upstream library (https://github.com/phillbaker/terraform-provider-mailgunv3/) and there has been a question about handling rate-limits (http 429s) (phillbaker/terraform-provider-mailgunv3#15).

Exponential backoff seems to be the path forward - but is that something that makes sense to handle here in the client or in downstream libraries?

Thanks!

@phillbaker this is a great question. I just finished work on a rate limiting system at mailgun but most of our API's have yet to utilize it. It will eventually provide clients with the following headers.

X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
X-RateLimit-Reset: 1372700873

I think it could make sense for this library to provide depending on the expected behavior.

For instance, if a call to GetDomains() is rate limited, should the call to GetDomains() block until the X-RateLimit-Reset time is reached then retried? It might not be clear to the caller why GetDomains() is blocking for such a long time. Perhaps we could make it an opt in feature such that the caller knows the call will be rate limited and might block for some time.

Thoughts?

Maybe the initialization could be changed (if it is changed anyway with v4) to have a parameter for rate limit handling?

Alternative a fixed var ErrRateLimitExceeded = fmt.Errorf("rate limit exceeded") for checking by users of the client library along with an internal state in the Mailgun with the reset time and a function WaitForWaitLimitReset() could be used.

@phillbaker I'm ok with either solution, @mbanzon's ErrRateLimitExceeded would be much more explicit,(which generally I like) but would require more boilerplate . It would be VERY explicit that the call was rate limited.

code might look something like this.

         ctx, cancel := context.WithTimeout(context.Background(), time.Second*30)
	defer cancel()

        // For Iterators
	it := mg.ListDomains(nil)
	for attempts := 0; attempts < 3; attempts++ {
		var page []mailgun.Domain
		for it.Next(ctx, &page) {
			for k, v := range page {
				fmt.Printf("\tDomain: [%d] - %+v\n", k, v)
			}
		}

		if it.Err() != nil {
			if mailgun.IsRateLimitErr(it.Err()) {
				mailgun.WaitForWaitLimitReset(it.Err())
				continue
			}
			log.Fatal(it.Err())
		}
		break
	}

        // Regular network calls
	var msg mailgun.StoredMessage
	var err error
	for attempts := 0; attempts < 3; attempts++ {
		msg, err = mg.GetStoredMessage(ctx, "http://foo")
		if err != nil {
			if mailgun.IsRateLimitErr(it.Err()) {
				mailgun.WaitForWaitLimitReset(it.Err())
				continue
			}
			log.Fatal(err)
		}
	}

But making everyone use this boilerplate kinda feels wrong, since every call should be handling the ratelimit error. People are gonna start wrapping our API in retry helper functions. Doesn't feel right. I'm not sure I like either solution to be honest.

Kubernetes does this thing where all of their network operation calls follow the signature error func(ctx, arg1, arg2, *object-to-get) such that if the user wants to support retry or rate limited fall back they can do this.

	var msg mailgun.StoredMessage
	retry(3, mg.GetStoredMessage(ctx, "http://url", &msg))

	it := mg.ListDomains(nil)
	
	var page []mailgun.Domain
	for retryIter(3, it.Next(ctx, &page)) != mailgun.EndIterationErr {
		for k, v := range page {
			fmt.Printf("\tDomain: [%d] - %+v\n", k, v)
		}
	}

The helper function retry() and retryIter() would handle ErrRateLimitExceeded and wait until reset and try again up to 3 times. If the user didn't care about rate limiting, or wants to handle it themselves, they have the option too.

This would require a complete change of the API. For the iterators the Next() would return an error instead of a bool and the client would need to check for EndIterationErr to know it's at the last page of results.

Not sure I like any of the solutions, I need to think about it some more.

@mbanzon @phillbaker What do you guys think?

Was curious about how aws sdk handles this, Found this issue. aws/aws-sdk-go#2608

A few thoughts came to mind while reading this.

  1. There will always be someone that wants immediate feedback and no retry support.
  2. There will always be someone that wants to use their own retry system.
  3. Although ALL of mailgun's network operations should eventually support rate limiting and all API calls should eventually return X-RateLimit headers. I can't guarantee when this will occur since our API internally is made up of several different microservices, each with their own SLA and RateLimit requirements.

This means we will have some calls that will not support rate limit fall back on the API side. The library or library user will need to implement some sort of conservative rate limiting for API calls that do not currently return X-RateLimit headers. Perhaps golang.org/x/time/rate ?

@thrawn01 thanks for all of the thoughts and digging - definitely some complexity here.

In a perfect world, I think this would be handled transparently by the library, however, given the complexity on the API side and the eventualities of users wanting to handle it themselves, maybe it's fine to close this as a "won't fix" for now and document supported a suggsted go retry pattern.

In the future, if there's a breaking API change scheduled, then the Kubernetes' approach of allowing users to specify their own error handling functions seems like a good way to go.

It is impossible to satisfy everyone 😏

Ideally, I'd rather have to write no code at all to handle this when using the library - checking error types and/or calling functions to wait for reset would annoy me and I'd rather quickly wrap that in my own functions to handle it.

If I consider the general problem of sending mail this all kind of fades into the background. Generally, I assume that the mail should be delivered. I don't care how long time it takes or how it is done. We send account confirmation and passwords reset eg. - that _needs _ to arrive at some point and I don't really want the hassle of having to keep trying and trying.

This leaves one choice really: Having the library block until it succeeds (network errors eg. should be handled by the user ofc.). This generally isn't a problem for me - all the handlers I have that send mail do it in a goroutine to avoid leaking information through call timing. Having such a goroutine running longer because of rate limiting isn't an issue (for me/us).

I think we should shelve this until v4.