App-vNext/Polly

Decorrelated jitter does not jitter well

george-polevoy opened this issue ยท 41 comments

Summary: Jitter the retry delays to achieve low concurrency



I'm trying to use the suggested "decorrelated jitter" strategy

jittering

In my experiment, the suggested decorrelated jitter does not achieve what it supposed to.

First, new Random() is not quite random, when it comes to reducing concurrency.
You can get multiple instances with the same seed, if many retries are issued simultaneously.

Following code achieves much lower concurrency and better distribution.
In the attached screenshot you can see that the 'advanced' graph has much lower maximum and smoother distribution, which is highly desired to avoid concurrency spikes.

IEnumerable<TimeSpan> DecorrelatedExponent()
{
    var r = new Random(Guid.NewGuid().GetHashCode());
    for (var softAttemptCount = 0.0; softAttemptCount < 4; ) {
        softAttemptCount += r.NextDouble() * 2;
        yield return TimeSpan.FromSeconds(
            Math.Pow(2, softAttemptCount) * random.NextDouble());
    }
}

Thank you @george-polevoy for diving deeper on this than I had time to and for sharing that with the community, this is super interesting. The decorrelated Jitter strategy quoted came from that AWS team research, definitely open to investigating different options.

First, could you say what the axes on the graphs represent? (A guess ... the number of instances (vertical) of a specific delay of some-or-other timespan (horizontal) when averaged over 100000 iterations of the algorithm ... but you can state precisely? )

Second, would you be happy to share the test code which generated the results? (maybe in a ZIP or github gist). It would be great for you/ I/ others to be able to validate, and to build further on your results and explore how tweaking the algorithms affects outcomes.

Following code achieves much better distribution

I am definitely seeing in your graphs that the green line is much more smoothly distributed than the yellow line as generated in the tests. (EDIT: But it would be good to, eg, dig into the code as @grant-d suggests, to verify why)

Following code achieves much lower latency

I want to be sure I have understood what we mean by this. Do we mean that the average delay that the algorithm chooses to introduce before retrying (average perhaps in some sense of area under graph/percentiles), is lower? Or do we mean (for example) that the average overall time to achieve success is less, for each algorithm pitted against a failed system which recovers after some random amount of time?

It would be really interesting to build on your experiments (if they do not cover this already) to be able to analyse the dimensions of overall latency to achieve success, combined with number of calls made to achieve success.

Thank you again for sharing this with the community!

George, would you mind trying your experiment with the Random taken out of the method and rather declared as a private static readonly Random on the class? That way we share it across concurrent retries so won't get the same seed (Datetime.Now.Ticks) . It would be interesting to see how the graph compares.
Also, see my abstraction in #526. If you are perhaps willing to use it in your test harness? Note that it too should be instantiated as a shared singleton, with each retry calling the factory Create method on it. See the sample code.

@reisenberger Dylan, sure there was a mistake. It was concurrency that is improved, not latency, I've updated the initial issue.

The vertical axis represents number of experiments. Virtual massive retry event of simultaneous 100000 requests could result in concurrency splash, which is illustrated. If I try to graph fewer experiments, the graph is just too noisy. I understand that the number of experiments is irrelevant, so it would be usefult to factor it out, and display just some normalized probability distribution.

I will try to come up with complete benchmark code, so others could explore and experiment.

TLDR; I agree with George - the guidance in the wiki is not optimal. It can lead to bad performance characteristics. Which is further evidence for my suggestion to include a formal jitter in the box (#526).

@george-polevoy, my concerns are not just a preference; my suggestion is that it may be the case that instantiating Random inside the method is also somewhat of a bug, perhaps leading to the correlations you're seeing in the decorrelated.jitter results. Random should be an external singleton.

There's probably further correlation due to the use of Guids as a source of (seed) randomness. Guids are built to be unique not random (there is a difference: the monotonic sequence 1,2,3,... is guaranteed unique but certainly not random).

Both your advanced and decorrelated.jitter test results will probably improve further if you move Random outside the method (we'll need testing to prove the degree of improvement)

[ThreadStatic] // Added, per George's observation re: thread safety
private static readonly Random r = new Random(); // Shared Random between all concurrent executions

IEnumerable<TimeSpan> DecorrelatedExponent()
{
    // code that uses 'r'
}

@grant_d Random is not thread safe, I'm afraid, and I'm using a random seed from Guid.GetHashCode, which is an alternative to using a thread local variable.

I'm using a random seed from Guid.GetHashCode

Once again, your source of randomness is not random - see my notes above, and see the first code snippet here, where your pattern corresponds to the warning he has in that code.

Guid.NewGuid().GetHashCode() is not random. Uniqueness and randomness are different things:
1,2,3,4,5,6,7... Unique, not Random
1,2,1,1,1,2,1,2,2,2... Random, not very Unique

Guid.NewGuid() is, in principle, more similar to the former than the latter.
Taking a hashcode of it makes it even less random, since you've compressed the domain to 32 bits.

(You are correct that I forgot the [ThreadStatic] attribute. Code sample updated)

Thanks to both of you @george-polevoy and @grant-d for progressing this! I haven't had much time on this thread but note that:

So: Should we just be following the recommendations here and here?

(V happy for further clarifications/corrections, as the above was just based on some brief reading!)

Thanks @reisenberger, I have used the various thread-static approaches but the locking one is interesting; I would not have expected to see advice to lock a primitive like Random in high-concurrency scenarios but in some ways it makes sense.

The [ThreadStatic] approach will work fine for the tests but if we ship these changes then we should noodle on which approach to use for the long-term.

Circling back to our use case: given we are calculating randomised delays of far greater order of magnitude than any delay contention due to fine-grained locking, I guess we can safely follow the advice and not care about the impact of locking.

@grant-d Randomness is not an issue here. I will share code for you to verify. My computations are now single threaded and deterministic, using a fixed random seed and a shared Random instance, not using Polly. Also, system clock issues can be ruled out, because this is just purely math integration based on deterministic pseudorandom sequence. The entire computation does not depend on runtime issues of NET, it's computed offline. It's just a math routine. I'm computing the delay sequence function itself using virtual regular time intervals.

I see the problem with your code.

In addition to min(cap, ...) that AWS team suggested in their blog, you've also added max(seed, ...).

Say we have 100 requests failed at time t0.
Seed = 1s, and max=3s.

What is the probability of scheduling a request at time t0 + 1s? It's 1/3, because all of the values lower than t0+1s clamp. So you've got aprox. 33 requests at exact timing: t0 + 1s.
All of the other requests will be distributed uniformly. The next clamp will appear at t0 + 2s, and concurrency will be lower (1/3^2 = 1/9 = aprox. 11) at exact timing t0 + 2s.

That completely explains the fist two spikes on the graph, which are clearly seen.

I've examined the code on the AWS blog, and they have slightly different thing. They use random_between, which is not max(something, random.NextDouble).
Random between should be a linear mapping, not limiting with Max.
Replace it with RandomBetween(x, y) = x + r.NextDouble() * (y-x) and the problem is gone.

I will follow up with the modelling code i'm using.

Agree - my code has a slightly different implementation to the wiki too, though I am still refining it. I am curious to compare it with yours, please do post it as soon as you are ready.

[Edit: This is stale - see related PR for correct algorithm]

For reference, here's my code:

            double range = MaxDelay.TotalMilliseconds - MinDelay.TotalMilliseconds; // Range
            for (int i = 0; i < delays.Length; i++) 
            { 
                double ms = range * _random.NextDouble(); // Ceiling
                ms += MinDelay.TotalMilliseconds; // Floor 

                delays[i] = TimeSpan.FromMilliseconds(ms); 
            }

Thanks both @george-polevoy @grant-d ! Would be great to see graphs of the distributions of generated by the revised algorithms.

Again, thank you both @george-polevoy @grant-d .

@george-polevoy I think your analysis here is exactly correct. Thank you for spotting the (now obvious, yes) flaw in the algorithm here at time of writing.

@grant-d The version in your PR (for which many thanks) also matches this analysis and the AWS algorithm, I believe. Thanks for all the work on this and the PR! (EDIT: PR suggestions will follow.)

The upper-end clamp

Analogous to the lower-end clamp problem which you have eliminated, @george-polevoy , your contribution led me to reason again about the upper-end clamp. There is of course a potential upper-end clamp after enough iterations, because of the ceiling imposed at MaxDelay.TotalMilliseconds. To have a MaxDelay seems reasonable, however, focusing pragmatically on our use case, rather than necessarily a 'flaw' in the algorithm - the concept of a "maximum delay" should make intuitive sense for users as a feature; and leaving the series unbounded would not necessarily give users a better experience! (think very long delays due to exponentiation). However, if MaxDelay is set too low in relation to seed for the number of retries configured/necessary, then retry intervals will certainly bunch (once MaxDelay is hit, there is a >=66% of the next interval also being MaxDelay, right?). Which brings the question: How can we best communicate the upper-end characteristics of the configuration parameters to users?

I drew users' attention to this in my existing guidance by emphasizing (from my own experience) that max should not be too low. However, do you think we can formalise this guidance any more precisely? I did some work in Excel, and it looks as if the algorithm we are exploring (matching the AWS algorithm) tends asymptotically to (seed * (1.5^n) * 4/3) if Random.NextDouble() were constantly to deliver its mean value of 0.5. (I'll try to upload the Excel somewhere.) Does anyone want to dig into the math to capture that more precisely?

Otherwise, perhaps some pragmatic warnings will suffice. For example, for an initial seed of 1 second, an "average" sequence (think Random.NextDouble() returns 0.5) reaches a 30-second-ish delay within 7 iterations and 50 seconds within 8 iterations. Perhaps some tables of numbers of this kind, for a few possible variations of seed and limiting number of retries to 5, 7, 8 is enough. Doubtless there will be some systems for which it may make sense, but from a real-world perspective, I don't see more than 7-8 retries or retries delayed by order-of-60 seconds as majority use cases. (EDIT: Background/batch processes may tolerate these levels of delay-before-retry, but again, after that degree of failure, it may be as well to stick the failing items back in some other queue for retrying later.)

Thoughts?

Finally, @george-polevoy : are you happy now with the behaviour of the AWS algorithm following the corrections you have introduced? (for which many thanks!). Or are there aspects of the 'advanced' algorithm from your original post you think still worth considering? I did dive deeper into the advanced algorithm - came up with a bunch of further questions - but won't lengthen this unless you are also interested in pursuing.

It would be awesome to see how the "fixed" AWS algorithm looks also in your graphs.

Alternatively, @grant-d , @george-polevoy , we could eliminate the upper-end-bunching by modifying the latest implementation like this:

        double ms = MinDelay.TotalMilliseconds;
        for (; i < retryCount; i++)
        {
            double ceiling = Math.Min(MaxDelay.TotalMilliseconds, ms * 3);
            ms = _random.Uniform(MinDelay.TotalMilliseconds, ceiling);

            yield return TimeSpan.FromMilliseconds(ms);
        }

Is this worth modelling (to confirm smoothness) in George's graphing system? Can anyone see disadvantages? One implication is that when ms * 3 > MaxDelay.TotalMilliseconds, there is no longer any in-built increase to the backoff. It speaks to the same problem previously discussed, that users do need to choose MaxDelay significantly proportionally larger than seed. However, that lack of increasing backoff may be preferable to the upper-end bunching.

This revisits my previous comment on implementations of System.Random, to provide promised references and clarify for the sake of accuracy and future readers:

the current .NET core implementation seems to use a static global instance of Random to provide random seeds for [ThreadStatic] instances of Random, which in turn are used to fulfil new Random() called on any thread. This approach was supposed (according to various posts by Jon Skeet and Stephen Toub; will try to find links again) ... supposed to improve randomisation behaviour for new Random() called on any thread.

References: post by Stephen Toub; Jon Skeet 1; Jon Skeet 2.

There are discussions on changing the algorithm for .NET Core/Standard, but no changes seem to have landed yet, afaics.

The latest .NET framework and .NET core implementations do in fact differ, because the .NET Framework 4.7.2 implementation still (!- for backward compatibility presumably) has Environment.TickCount as a seed new Random(), whereas .NET Core has moved on to a chained-RNGs approach. (The discussions on changing the algorithm were around the Knuth algorithm vs others, rather than the seed.)

However, Jon Skeet seems to have rowed back from [the chained-RNG approach] and the latest .NET Framework recommendation is to use a single instance application-wide (for maximum randomness), but with fine-grained locking or similar (because a single instance is not thread-safe for simultaneous access from multiple threads).

The recommendation I linked pertains to .NET Framework not .NET Core. However, in targeting .NET Standard, Polly of course can still (as is well known) be consumed by builds against .NET Framework 4.5 and up. So we should still take account of the .NET Framework recommendation (EDIT as PR #536 does) if for jitter we are targeting maximum randomness.


Broadly, it is essential to make Random thread-safe, and these discussions embrace two different approaches to that which exhibit different trade-offs:

  • a fine-grained-lock on a single static instance of Random
    • emphasizes improved randomisation but introduces fine-grained locking (per acquisition of each random number), which could affect perf in high-concurrency scenarios.
    • recommend for Polly's Jitter use case: improved randomisation is important, to avoid spikes; delay due to lock contention is insignificant in the context of introducing jitter delays and executions which have already failed anyway.
  • a chained-RNGs approach with thread-safety achieved by [ThreadStatic] or ThreadLocal<T>
    • doubts over its randomisation credentials being as strong as the preceding approach, but improved perf in high-concurrency scenarios due to no locking.
    • recommend for Polly's chaos-injection use case; we would want randomised chaos-injection to have absolutely minimal drag on executions which weren't randomly selected for chaos injection, so avoidance of locking is preferred; that consideration is more important than how randomly the chaos-impacted executions are selected.

@reisenberger Algorithm by AWS also has a flaw. On my graphs it looks smoother, but still has spikes, because it's not truly decorrelated.

@grant-d The problem with clamping using Min Max is that it ruins randomness, flooding the distribution with some constants. As for the API guarantees, you can use math to proof the MAX and MIN. For ex. max recovery interval for retry is a sum of geometric progression, not matter if you randomize inside, if only limited range random variable does not participate in any asymptotic math function such as 1/x.

My graphing framework turned to be very useful to analyze the problem. I will share it as I promised as I clean up the code.

Currently busy preparing for my DotNext talk this week.
I will follow up next week. I've put substantial effort into exploring this, and I will make sure we use strong math and also we will have good visual proof of what's happening in our formulas.

Again, big thanks both @george-polevoy , @grant-d for everything you are contributing on this. Looking forward @george-polevoy to seeing more.

Very nice, I like how clear the AWS clamping problem is (glad we fixed that).
Is your aim with the other algorithms to include them in the library? If so I have no problem with that, I think softexp is particularly interesting.

I think "soft exp, soft delay X" family looks best to me. It keeps exponential property, has reasonably low peak, low at start, and a hard limit of $2^(n+1) + 1$
But most importantly, it reduces the probability of sharp peaks.

I have another framework which can be used to analyse concurrency issues. I will try to run a simulation with different kinds of retry strategies to see how they actually affect concurrency in fair-scheduling scenarios.

Stochastic algorithms approach is different from deterministic reasoning, no need to clear the interval from 0 to the first retry completely, because you don't know the exact time of start of the transient failure condition anyway, why not to try during that interval, if it helps distribute retries evenly?

Stochastic algorithms approach

Sorry George, I don't understand what you are getting at in the last paragraph, would you mind explaining again?
[Edit]: OK, I think I understand. You're saying no need for 'MinimumDelay' in this scheme, rather just retry somewhat immediately?

Huge thanks @george-polevoy . Now digging in to this. (Bear with me if it takes a few days to respond; everything I do for Polly happens separately from paid-work ...)

@grant-d Yes, I mean immediately.

Instead of minimum delay, there could be half decay period parameter, which is more meaningful for the exponential function. This would be a point in time when probability drops to 1/2.
Exponential property guarantees that the amount of requests going before the half decay would equal to the amount of request coming after (green and purple on the graph have the same area).

https://www.desmos.com/calculator/qmqaapanvc

image

@george-polevoy I've run out time to dive into these in depth this week, due to the number of different strands going through on Polly, but wanting to get to this soon. Thanks!

@george-polevoy Again, thank you. I now digested the algorithm math and simulation basis, very nice. Nice use of the natural shape of tanh to smooth just the initial part of 'soft exp soft delay' ;-}

See the comment on @grant-d 's PR: we could pull some of the new algorithms here into a separate Polly.Contrib package? (if you are willing), together with what @grant-d has put together for other sleep duration strategies.

A remaining thought: it could be useful to understand for the new algorithms where the 50th percentile (say) falls - or see the shape of the graph as a whole - for try 1, try 2, try 3, try 4, separately ... to see how those 50th percentiles (/or the distribution) progresses, with increasing tries.

For example, looking at the graph (and formula) for 'exp jittered (Shifted)', it seems clear the 50th percentiles likely fall with the apex of each peak. But for 'soft exp, soft delay', we probably need to mine some data (/see a graph-per-try) to get a clear idea of how the 50th percentiles fall and progress.

I've worked out how to amend your simulation code to generate this, so I can do, but if it's quick for you/you feel like pulling that out, say for the 'soft exp, soft delay 3' case, that would be useful to see.

Why? Mostly we have focused on how to make the distribution smooth, to fair-schedule retries to reduce contention, perhaps with the background assumption that the underlying system is basically healthy. But I want to keep also in focus the dimension that the retries should happen with increasing delay, for the case of a system which is unhealthy/failing/struggling under load. For unhealthy systems, it can be important that later retries significantly back off - to reduce the potential for retries creating a self-DDOS attack. (Sure, the problem is stochastic, but broadly speaking, by the time fourth retries (say) start kicking in, we will be quadrupling the load.) (And of course, there are other techniques to avoid this, like circuit-breaking on too many failures, but as a gen principle it's helpful for retries to have increasing back off.) So, I would just like to dig into and be sure we understand this aspect - the degree of increasing backoff emerging from the new algorithms - before we go to the Polly public with a recommendation.

Thanks again, @grant-d , @george-polevoy , for everything we are doing on this.

@george-polevoy @grant-d I adapted George's simulation code (for which huge thanks) to graph up the retry interval distribution for individual tries separately. I am happy that the "soft exp, soft delay" function does create clearly increasing backoff as try number progresses. For example, for "soft exp, soft delay 3".

Soft exp, soft delay 3: Distribution of retry intervals for try 0

image

Soft exp, soft delay 3: Distribution of retry intervals for try 1

image

Soft exp, soft delay 3: Distribution of retry intervals for try 2

image

Soft exp, soft delay 3: Distribution of retry intervals for try 3

image


EDIT: Code used as the basis for these graphs, posted here

EDIT 2: The same graphs for "decorrelated jitter (AWS)" clearly show the spikes we expected (particularly for tries 0, 1, 2), and also show (as expected) that the "decorrelated jitter (AWS)" reaches back to make some second and third retries after near-zero delay (taking a chance on the system having recovered, if you like).

Thanks again @grant-d and @george-polevoy for the great investigation into jitter strategy.

Polly v7 has now launched, and we have a Polly-Contrib. I propose - if you are in agreement - we take @grant-d 's PR #536 and your awesome jitter work here @george-polevoy, and make a Polly.Contrib repo Polly.Contrib.WaitAndRetryStrategies to publish it. The repo would give you:

  • full 'write' rights to manage the repo, make/merge PRs etc
  • isolated box (not mixed with anyone else's contrib; only you or people we invite have rights)
  • a template with solution/projects correctly set up to reference latest Polly and build targets
  • and a full build ready to publish your nuget package which will correctly reference Polly.

(ie you can basically just rename from the template, place your code in, and good to go!)

If this sounds good, I can clone to make you a Polly.Contrib.WaitAndRetryStrategies.

(Polly.Contrib also allows me to "get out of the way" and allow contributors such as yourselves to publish these contributions without being blocked on feedback or discussion about exact shape of the API - although I'm still v much there to support if needed. And evolve, eg @george-polevoy , if you want to add something later with half-life/half-decay parameter.)

Also, if we go the contrib route, I would still want to feature this as strongly out from the main Polly, ie to replace/update the wiki jitter information and reach out to your contrib from there and from the readme.

Let me know what you think.


@george-polevoy Also, I did some work with Cesar de la Torre around resilience in the Microsoft Microservices sample app couple of years ago, some of it ended up in this ebook, parts of which are also online here. Assuming we publish your jitter strategy to Polly.Contrib.WaitAndRetryStrategies, we can also suggest a PR to update that Microsoft jitter documentation to reference your better strategy.

That sounds like a great idea, thanks for thinking through it in such detail. The only request I'd make is to that the name is a mouthful, I wonder if there is something more concise. Polly.Contrib.WaitAndRetry or .Backoff or the like

Appreciate your humbleness but your involvement is critical to he success of contrib. My submission is much improved after your input. So you 'getting out the way' is not on my priority list.

No worries @grant-d. We've wanted to get the Polly.Contrib in place for a while, so that users can build additions without the Polly team having to sign off every decision, it gives everyone (me too) a bit more flexibility. The retry strategies are quite a small case (could go either way), but partic the option to keep refining the jitter math point to Contrib.

I am good with Polly.Contrib.WaitAndRetry or open other suggestions.

To know: The choice becomes the namespace ie namespace Polly.Contrib.WaitAndRetry and thus using Polly.Contrib.WaitAndRetry

I've created the Polly.Contrib.WaitAndRetry repo and am sending you both (@grant-d @george-polevoy) invites to the Polly-Contrib organization.

@george-polevoy I'd suggest we use your 'soft exp, soft delay X' algorithm as the best jitter candidate so far (unless you think different). I've invited you to the repo (ie you get write-access) so that you can continue to make changes/suggestions to refine the jitter algorithm further if you want; it's been awesome to have all your great input so far.

@grant-d Are you good to pull the conclusions we came to on this/and other thread into the new repo? You can pull down a zip from Polly.Contrib.BlankTemplate, and just rename/add to form the new Polly.Contrib.WaitAndRetry, then upload back up to Polly.Contrib.WaitAndRetry. After you accept the Polly-Contrib invite you have write rights on the repo.

Let me know when there's a nuget publish candidate and we can figure out the publish step (publish under a Contrib org or individually)

Thanks hugely to you both for the awesome contribution!

Let me know if any questions/anything I can help with.

Initial commit here : Polly-Contrib/Polly.Contrib.WaitAndRetry#2

I just saw your note above about the template - I will retrofit it suitably.

@george-polevoy Are you happy if we include your 'soft exp, soft delay X' algorithm in the Polly.Contrib.WaitAndRetry repo? (We can do all the necessary work to move the code there.)

PR made to pull the new jitter formula into Polly.Contrib.WaitAndRetry : Polly-Contrib/Polly.Contrib.WaitAndRetry#11

Closing. The new jitter formula is now published to nuget as part of Polly.Contrib.WaitAndRetry, and documented in the readme of that project with references back to here.

Thanks to all @george-polevoy @grant-d for raising the issue and clarifying the math and logic so strongly.

Polly wiki on jitter similarly updated.