SpinWait drastically slows continous ID generation rate for small tick length
kimsey0 opened this issue · 10 comments
When using a small tick length together with SequenceOverflowStrategy.SpinWait
, the continuous ID generation rate slows down much more than expected once you overflow the sequence number and start spinwaiting.
As an example, the default options use 12 sequence bits, allowing 4096 IDs to be generated during a 1 millisecond tick. On a fast enough machine that can actually generate these IDs in less than 1 millisecond, you'd then expect to be able to generate IDs continuously at a rate of 4096 per millisecond, spinwaiting until the start of the next tick after each overflow. However, in reality, I can only generate roughly 320 IDs per millisecond continuously.
The issue seems to be that System.Threading.SpinWait
doesn't actually keep spinning, but instead initiates a context switch after a short while. It may then take 10 to 15 milliseconds before the thread is scheduled again, thereby skipping a lot of ticks. I have tested that the time required to generate (N > 4096) IDs roughly correspond with the time required to SpinWait
for (N - 1) / 4096
milliseconds. See benchmark below.
A solution could be to switch from System.Threading.SpinWait
to an implementation that doesn't initiate context switches. See example code below. This of course doesn't prevent the operating system from preempting the thread and context switching anyway, but it does seem to work in my benchmark.
A more durable solution is probably to switch to longer ticks, perhaps 10 ms, where the context switching matters less, but that would of course be on consumers. (Changing the default would be breaking.) But maybe the IdGen documentation could at least describe the performance penalty of short ticks together with spinwaiting.
Benchmark code
public class IdGeneratorBenchmarks
{
private IdGenerator _generator;
[Params(4096, 4097, 40960)]
public int N;
[IterationSetup(Target = nameof(IdGeneratorCreateId))]
public void IdGeneratorSetup()
{
// This ensures the previous operation hasn't used up the available sequence numbers for the next one.
_generator = new(0, new IdGeneratorOptions(sequenceOverflowStrategy: SequenceOverflowStrategy.SpinWait));
}
[Benchmark]
public void IdGeneratorCreateId()
{
for (int i = 0; i < N; i++)
{
_generator.CreateId();
}
}
[Benchmark]
public void SystemThreadingSpinWait()
{
var millisecondsWaitNeeded = (N - 1) / 4096;
var last = 0L;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < millisecondsWaitNeeded; i++)
{
SpinWait.SpinUntil(() => stopwatch.ElapsedMilliseconds > last);
last = stopwatch.ElapsedMilliseconds;
}
}
[Benchmark]
public void ManualSpinWait()
{
var millisecondsWaitNeeded = (N - 1) / 4096;
var last = 0L;
var stopwatch = Stopwatch.StartNew();
for (int i = 0; i < millisecondsWaitNeeded; i++)
{
while (stopwatch.ElapsedMilliseconds <= last) ;
last = stopwatch.ElapsedMilliseconds;
}
}
}
Benchmark results
| Method | Job | InvocationCount | UnrollFactor | N | Mean | Error | StdDev | Median |
|------------------------ |----------- |---------------- |------------- |------ |------------------:|-----------------:|------------------:|------------------:|
| SystemThreadingSpinWait | DefaultJob | Default | 16 | 4096 | 30.01 ns | 1.225 ns | 3.434 ns | 29.24 ns |
| ManualSpinWait | DefaultJob | Default | 16 | 4096 | 30.62 ns | 0.645 ns | 1.227 ns | 30.54 ns |
| IdGeneratorCreateId | Job-TRHHUL | 1 | 1 | 4096 | 298,785.71 ns | 1,712.429 ns | 1,518.024 ns | 298,200.00 ns |
| SystemThreadingSpinWait | DefaultJob | Default | 16 | 4097 | 15,446,881.15 ns | 67,378.315 ns | 63,025.718 ns | 15,455,123.44 ns |
| ManualSpinWait | DefaultJob | Default | 16 | 4097 | 1,000,269.71 ns | 108.778 ns | 96.429 ns | 1,000,239.94 ns |
| IdGeneratorCreateId | Job-TRHHUL | 1 | 1 | 4097 | 10,610,806.00 ns | 1,889,663.490 ns | 5,571,715.435 ns | 13,349,950.00 ns |
| SystemThreadingSpinWait | DefaultJob | Default | 16 | 40960 | 136,070,392.11 ns | 2,665,274.841 ns | 2,962,445.923 ns | 136,095,775.00 ns |
| ManualSpinWait | DefaultJob | Default | 16 | 40960 | 9,000,772.81 ns | 141.551 ns | 132.407 ns | 9,000,756.25 ns |
| IdGeneratorCreateId | Job-TRHHUL | 1 | 1 | 40960 | 128,078,597.00 ns | 4,144,345.151 ns | 12,219,695.183 ns | 135,176,300.00 ns |
I am planning for an IOverflowHandler
interface and a few 'built-in' overflowhandlers, like a SpinwaitOverflowHanler
and ThrowOverflowHandler
and maybe another one or two. This would allow users to implement their own overflowhandler which could, I dunno, raise an event or write to a log or whatever.
However, this is not something that'll make it in this version; I expect this in v-next at the very earliest.
Thanks. An IOverflowHandler
sounds like a good way to allow overwriting the overflow behavior.
Would it make sense to change the current behavior (and the behavior of a future SpinwaitOverflowHandler
) to manually spin for short tick durations (at least for <= 1 ms
, so it works for the default options) to avoid the performance pitfall described above?
That would be an option, yes. But I also think you're attacking this problem from the wrong angle; this overflow shouldn't happen in the first place - and if it does, it should be rare. It sounds to me like you're experiencing this problem on the regular, which means you may have chosen the wrong Id structure with too few sequence bits.
We haven't rolled out IdGen yet, but are in the process of implementing it. We are not expecting to have a lot of overflows, but would prefer not to have 15 ms hiccups if they do occur.
If you think manually spinwaiting would be a good drop-in change for shorter tick duration, I'm happy to submit a pull request. I'd also be happy to submit one that adds a line to the XML documentation of SequenceOverflowStrategy.SpinWait
that describes the current behavior instead.
The whole point behind IdGen is that it's intended to be used in a distributed scenario. When you run into overflows then there's your problem. You should:
- Add another generator. That may be in a new, separate, thread, process or even process on another host (the 'distributed' part) and spread the load
- Reserve more bits for the sequence (remember: every bit you add doubles the size of the sequence (at the cost of either the host or timestamp bits)).
- And only THEN start accounting for overflows
I strongly feel like you're taking the wrong approach here. Have you considered the first two points and, if so, could you elaborate on why those aren't an option for you?
That said, the documentation could use a little brush-up on that part. I'll add in a remark on the SpinWait option.
We are of course intending to use IdGen in a distributed system and are carefully allocating bits to meet our needs for longevity (timestamp bits, time increment, and epoch), horizontal scalability (generator bits), and vertical scalability (sequence bits). As I mentioned, we're not deliberately aiming to have overflows, but we think it would be optimal that if an overflow happens, IdGen waited until the next tick instead of waiting so long it skips 15 ticks. I also understand that you don't consider it a priority, since it's an edge case that shouldn't occur for most consumers.
I'm happy that you'll add a remark for the SpinWait option. Thanks! 😄
As I mentioned, we're not deliberately aiming to have overflows, but we think it would be optimal that if an overflow happens, IdGen waited until the next tick instead of waiting so long it skips 15 ticks.
Again, an overflow should happen very, very rarely, if ever at all. I don't see why this is such a big deal and why you're trying to handle it with busywaiting. If this ever happens your problem is elsewhere.
I added the spinwait a while back very reluctantly; personally I rather see it throw so I can can catch the exception and notify / log and maybe even spin up extra instances or whatever. When your system is in a state it can't keep up, the last thing you want to do is make it busywait wasting CPU cycles where it could be doing actual work, exacerbating the situation even further.
Also, did you also notice we have a TryCreate()
method? You could easily call that in a loop:
var id = 0L;
while (!generator.TryCreateId(out id));
Console.WriteLine(id);
This is, essentially, a busywait as well (there's a little nuance though).
That said, yes, the current implementation isn't optimal. Yes, there is room for improvement. And I will implement that soon - but at the earliest with v-next.
I am curious now, though: could you tell me what IdGen is used for in your project? How many hosts/generators are we talking? How many ID's are you generating at peak moments? What is the absolute minimum timestamp resolution you need? Assuming you're using the default IdStructure (41/10/12) then you could have 2^10 = 1024 generators generating 2^12 = 4096 id's per millisecond. That's 4,194,304,000 ID's per second! Let's assume some headroom etc. and use even just 10% of that capacity, then you can still generate 419,430,400 ID's per second without breaking a sweat. That's 419 million ID's per second.
Let's make it even worse: 10% of generators (let's call it a round 100) with 10% of the sequence (let's call it 400) then you can still generate 40,000 id's per millisecond, or 40,000,000 id's per second. Who needs that many - realistically?
I am curious now, though: could you tell me what IdGen is used for in your project? How many hosts/generators are we talking? How many ID's are you generating at peak moments? What is the absolute minimum timestamp resolution you need?
Sure! We're planning to use Snowflake IDs for entities created by a range of heterogeneous micro-services running in Kubernetes. We'd like to use the host part of the pod IP address, allocated in a suitably sized subnet, as part of the generator ID for simplicity. We want to be able to scale to 8192 pods per cluster, and to have an extra 2 bits for spinning up and moving traffic to new versions of the cluster, requiring a total of 15 bits for the generator ID.
We would like 30+ years before we run out IDs, meaning we need need 40 bits for timestamps with a 1 ms resolution, 37 bits with a 10 ms resolution, 34 bits with a 100 ms resolution, etc. That leaves 8, 11, or 14 bits for the sequence number, allowing us to generate 256 IDs per 1 ms, 2048 per 10 ms, or 16,384 per 100 ms, respectively.
Different services have wildly different patterns for ID generation load. None have sustained loads that require generating more than for example 256,000 IDs per second in one pod, but we have seen rare peaks where one pod was generating 1000 IDs in a few milliseconds. This is where busywaiting, though wasteful, may be acceptable for us.
We don't have a strict minimum timestamp resolution, so that's the part we'll adjust to avoid sequence overflows from short load spikes while keeping IDs from among services as well-ordered as possible.
Have you considered the TryCreate(...)
approach I suggested? That may be the easiest way to still get a performant busywait.
Also; it sounds like all entities get a unique id in your setup; but you can ofcourse consider each entity (say, Customer
and Order
and Product
) as a 'namespace' (depending on your requirements that is). You can perfectly use another instance of a generator with a same generator-id as long as you are careful to use the correct generator for the correct entity, but that should be easy to 'fix' if you implement a GeneratorProvider
class that you can ask for a specific generator using generics (like GetGenerator<Customer>()
) and that keeps track of the generators. Usually there's no problem in a customer with id 7 and a product with id 7; the id's don't "collide".
Have you considered the TryCreate(...) approach #53 (comment)? That may be the easiest way to still get a performant busywait.
We will consider it. 🙂
Also; it sounds like all entities get a unique id in your setup; but you can ofcourse consider each entity (say, Customer and Order and Product) as a 'namespace' (depending on your requirements that is).
Yes, that's a possibility. In the cases where we see spikes in ID generation for one pod, it's often of just one or two entity types, though, so it may not make much of a difference there and we will have to weigh if it's worth the added complexity. Having unique IDs across entity types also simplifies for example searching in log files.