Clarify conflicting guidance around how long to back off when being throttled

Question

Clarify conflicting guidance around how long to back off when being throttled

Closed this issue 5 years ago · 8 comments

Category

Question
Typo
Bug
Additional article idea

There are a few official sources that provide advice on how to handle throttling (i.e., HTTP 429 responses) from SharePoint Online:

Avoid getting throttled or blocked in SharePoint Online shows some code snippets that suggest using an exponential backoff with a base delay of 30 seconds.

This is in contrast to Updated Guidance around SharePoint Web Service Identification and Throttling which suggests "the correct behavior is to respect the 429 response and retry the request based on the retry-value in the 429 message." To date, we've seen that the Retry-After value in the 429 responses coming from SharePoint is always 120.

The key questions are

Should we use the Retry-After value or a hard-coded value? I'm guessing the Retry-After value would be preferable, as it allows SPO to alter the backoff period appropriately as the throttling limits change.
If we use Retry-After, should we still use an exponential backoff for continued failures, or just always use whatever value was supplied in the Retry-After header?
Our application is scaled out across many threads/machines, all of which may be concurrently calling SharePoint Online via the CSOM. If we receive a 429 response on one thread, should we try to stop all other threads from calling SPO for the specified period as well?

As an aside, The SharePoint Online Throttling Core Sample suggests the same as above, but has an inconsistent comment - the code says 30 seconds, but the comment says 10 seconds. I presume this is a typo and it's meant to be 30.

Thanks :)

(copied from SharePoint/PnP-Guidance#193)

Answer 1 · 2018-03-12T11:08:48.000Z

+1 for clearer guidance on if the retry-after header should be respected. Personally I think it should be and the PnP code samples and guidance articles need updating to reflect this, unless there is a compelling reason not to in which case this reason needs to be stated on the guidance docs.

@ahofman about point 3, I would definitely recommend stopping all processing for a tenant across all threads. I work on an application that can be deployed to multiple sites in a SharePoint tenant and receives various event notifications which are placed onto a queue for processing. We process several messages at once and we use the async/await pattern which can result in several concurrent calls per message. Because of this when we received a throttle response I found that the recommended guidance did not provide a suitable back-off mechanism for this scenario. Due to the back-off not being coordinated across threads/messages there was effectively no back-off.

The solution I am currently implementing for when a throttling response is received is to write a single tenant throttling record to Azure table storage with an expiry time (the last thread to write the record sets the expiry) to prevent any new messages being processed and a cancellation token is used to cancel any currently running threads. New messages will check the expiry time and remove the record if it has passed to allow processing to continue.

Answer 2 · 2019-07-10T11:52:20.000Z

@VesaJuvonen @andrewconnell You din't answer yet. Can you please solve this issue asap. This issue was opened 5 months ago. Please consider

Answer 3 · 2019-07-10T14:28:12.000Z

The official guidance doc on SharePoint Online throttling states you should use the retry-after header response, as stated here (emphasis added):

That's why it's so important for your CSOM or REST code to honor the retry-after header value; this lets your code run as fast as possible on any given day, and it lets your code back off "just enough" if it hits throttling limits. The code samples later in this article show you how to use the retry-after header.
...
If you do run into throttling, we require leveraging the retry-after header to ensure minimum delay till the throttle is removed.

Retry after is the fastest way to handle being throttled because SharePoint Online dynamically determines the right time to try again. In other words, aggressive retries work against you because even though the calls fail, they still accrue against your usage limits. Following the retry header will ensure the shortest delay.

Directly answering the OP's questions, @ahofman said:

Should we use the Retry-After value or a hard-coded value? I'm guessing the Retry-After value would be preferable, as it allows SPO to alter the backoff period appropriately as the throttling limits change.

Use the retry-after value.

If we use Retry-After, should we still use an exponential backoff for continued failures, or just always use whatever value was supplied in the Retry-After header?

Use the retry-after value; you don't need to use it as the basis for a calculation.

Our application is scaled out across many threads/machines, all of which may be concurrently calling SharePoint Online via the CSOM. If we receive a 429 response on one thread, should we try to stop all other threads from calling SPO for the specified period as well?

That would be your responsibility to control the requests. SPO isn't seeing your request as coming from multiple scaled out threads/machines, it's seeing it as one. IOW, if one gets a 429, all will, so you need to incorporate it into your logic.

As to the issue with the PNP sample solution referenced, it appears to be using a hard-coded backoff which doesn't correspond to the current guidance. That sample was published prior to the throttling limits were put in place & prior to the updated guidance. From my POV, it doesn't reflect the current guidance. I'll submit an issue to get that updated.

Answer 4 · 2019-07-10T15:13:51.000Z

@andrewconnell Just a small clarification here.. As said here, ExecuteQueryWithRetry implementation will be included in Office DevPnP nuget package.

I've observed the code, previously they were making use of Retry After header, but then, recently they 've changed it to exponential back off approach with a comment like this "Retry-After seems to default to a fixed 120 seconds in most cases, let's revert to previous logic". Please clarify on this.

Answer 5 · 2019-07-10T15:35:14.000Z

The document you are referencing pre-dates the current guidance of using the retry after header... thanks for pointing it out... I'll get it addressed.

As for the PnP Core (I can't find what you're referencing as you're pointing to the entire repo, not a specific commit/line of code), I can't comment on the decision. I can just point to the official guidance. I'll look into the fixed 120s point though and try to get the people responsible for this to chime in on this thread...

Answer 6 · 2019-07-10T16:10:36.000Z

Thanks for such a quick reply & I am very grateful to you for this :)

As for the PnP Core (I can't find what you're referencing as you're pointing to the entire repo, not a specific commit/line of code)

Just FYI, the code (commit) I pointed to is in ClientContextExtensions.cs at Line 195 of Office DevPnP. And this code (commit) was made on 18 Dec 2018.

Answer 7 · 2019-07-11T14:37:46.000Z

OK... spoke with the folks at MSFT (thanks @JeremyKelley!) who are involved with the throttling guidance so I can finally close this discussion :)

The official word is what the docs say: use the retry-after header value. Anything beyond this, that you add, is just extra on your side so you're effectively going the "extra mile".

Today, the retry-after header has a static value of 120s, but that will/may change in the future. Your best bet is to use the value returned today & in the future.

I am intentionally not going to comment on the PnP Sites Core decision to change their logic. If you have questions about that project, you should post your question to their repo.

Answer 8 · 2020-01-28T12:01:53.000Z

Issues that have been closed & had no follow-up activity for at least 7 days are automatically locked. Please refer to our wiki for more details, including how to remediate this action if you feel this was done prematurely or in error: Issue List: Our approach to locked issues