Performance in browsers

Question

Performance in browsers

rubensworks opened this issue 5 years ago · 41 comments

In comunica/comunica#561, we identified AsyncIterator as (one of) the main cause of slowdown in query performance.
While Asynciterator is very fast in Node.js, it is orders of magnitude slower in browsers such as Chrome.
The cause of this seems to be the extensive usage of setImmediate in AsyncIterator, which is fast in Node.js, but not in browsers (due to shims).

I've started a branch on my fork to do some profiling.

This is the current output:

Node

elements,transformers,time
1000,1, 8.535ms
1000,10, 26.684ms
1000,100, 42.072ms
10000,1, 14.416ms
10000,10, 21.166ms
10000,100, 105.672ms
100000,1, 115.589ms
100000,10, 199.156ms
100000,100, 981.854ms

Browser

elements,transformers,time
1000,1, 60.205810546875ms
1000,10, 280.35693359375ms
1000,100, 2386.37890625ms
10000,1, 365.549072265625ms
10000,10, 2032.72705078125ms
10000,100, 26157.5810546875ms
100000,1, 3524.163818359375ms
100000,10, 27374.552978515625ms
100000,100, 213922.16796875ms

These results show that both the stream size and the number of transformers have a significant impact in browsers.

I'll play around a bit to see what optimizations are possible.

Answer 1 · 2020-01-31T13:15:49.000Z

setImmediate(emit, this, 'readable') appeared to be the main problem here.

I did some minor adjustments to squash nested setImmediate into one large setImmediate call in rubensworks@841f206.

@RubenVerborgh Could you have a look at this?
If you think this type of solution is ok, I'll clean up things, and test on some larger use cases (because I think this change may break things for some edge-cases).

With this change, we get the following performance (almost the same performance in Node and browsers):

Node

elements,transformers,time
1000,1, 6.541ms
1000,10, 32.780ms
1000,100, 30.866ms
10000,1, 8.172ms
10000,10, 18.436ms
10000,100, 144.885ms
100000,1, 24.727ms
100000,10, 140.363ms
100000,100, 1240.560ms

Browser

elements,transformers,time
1000,1, 14.528076171875ms
1000,10, 60.06884765625ms
1000,100, 72.163818359375ms
10000,1, 14.634765625ms
10000,10, 22.64697265625ms
10000,100, 149.65087890625ms
100000,1, 72.8623046875ms
100000,10, 164.00390625ms
100000,100, 1174.532958984375ms

Answer 2 · 2020-01-31T14:22:49.000Z

Thanks @rubensworks, will study this in detail.

setImmediate(emit, this, 'readable') appeared to be the main problem here.

Very interesting; do you have numbers to share?

Coincidentally, one of the design considerations in asynciterator was to fix a problem in the old internal iterators within ldf-client, where an explicit state was not kept. As such, in that implementation, the readable event would be fired much more than needed. By keeping an explicit state, we can emit readable only when a relevant state change happened. Now there might have been bugs in that implementation, but it was a consideration.

So there are two questions:

are all of the readable events correct, or are we emitting more than strictly needed?
which of the readable events should be emitted async, and which sync? Because we do want to avoid the scenario where a readable listener is attached right after a method call to an AsyncIterator, but never fired because that method emitted readable sync.

Answer 3 · 2020-01-31T15:10:11.000Z

Very interesting; do you have numbers to share?

I didn't count, but it looked like almost every iterated element resulted in a readable event, which then caused a buffer refill call, all behind setImmediate's.

My commit does not change the number of events though, so there doesn't seem to be a performance issue with them.

are all of the readable events correct, or are we emitting more than strictly needed?

I haven't looked at it in detail, but I think that multiple readable events may be emitted for the same data element (not 100% certain about this yet though).

In any case, I don't think that the number of events is the problem, but instead the fact that they were emitted behind a setImmediate.

which of the readable events should be emitted async, and which sync?

I also had this concern. I think we may want to have to types of readable events:

an async one that can be used by external consumers (and optionally internal as well).
a sync one that can only be used internally for performance reasons (as currently implemented in my commit above).

Answer 4 · 2020-01-31T15:11:16.000Z

FYI, if you want to run the performance tests in my fork, you can execute npm run perf and npm run perf-browser.

Answer 5 · 2020-02-05T13:03:55.000Z

I've added some measurements on BufferedIterator in 93d0e8f.

The performance difference seems to be still apparent here, especially when the number of stream elements is small:

Node

elements:transformers:time
1000:1: 9.666ms
1000:10: 37.189ms
1000:100: 62.169ms
10000:1: 48.658ms
10000:10: 38.543ms
10000:100: 294.332ms
100000:1: 50.555ms
100000:10: 281.684ms
100000:100: 2628.572ms

Browser

elements:transformers:time
1000:1: 127.3662109375ms
1000:10: 268.007080078125ms
1000:100: 496.833984375ms
10000:1: 260.447998046875ms
10000:10: 307.476806640625ms
10000:100: 763.65625ms
100000:1: 313.001220703125ms
100000:10: 543.89404296875ms
100000:100: 2924.14794921875ms

Answer 6 · 2020-02-05T13:18:15.000Z

Interesting. Can we also log number of setImmediate calls?

Answer 7 · 2020-02-05T13:54:08.000Z

bd5c111 now logs the number of setImmediate calls.

(only logs in Node.js)

Transform

elements:transformers:time(ms):setImmediates
1000:1:6:5
1000:10:45:32
1000:100:38:302
10000:1:11:9
10000:10:26:36
10000:100:140:306
100000:1:30:53
100000:10:154:80
100000:100:1430:350

Buffered

elements:transformers:time(ms):setImmediates
1000:1:11:257
1000:10:52:320
1000:100:76:950
10000:1:40:461
10000:10:41:524
10000:100:382:1154
100000:1:53:455
100000:10:346:518
100000:100:2995:1148

To me, it looks like the number of setImmediate's definitely influences the times in browsers, especially for the BufferedIterator case. I'll have a look at possible performance improvements now...

Answer 8 · 2020-02-05T14:14:16.000Z

For reference, the problem with the BufferedIterator seemed to be caused by the number of end events: 539352a.

With this change, the number of setImmediate calls is very slow, and the performance difference between Node and browser is minor:

Node

elements:transformers:time(ms):setImmediates
1000:1:8:4
1000:10:35:22
1000:100:34:202
10000:1:40:8
10000:10:94:26
10000:100:269:206
100000:1:57:52
100000:10:303:70
100000:100:2722:250

Browser

elements:transformers:time(ms):setImmediates
1000:1:56:0
1000:10:75:0
1000:100:68:0
10000:1:52:0
10000:10:63:0
10000:100:339:0
100000:1:118:0
100000:10:494:0
100000:100:2602:0

Answer 9 · 2020-02-05T14:17:07.000Z

Very interesting. Maybe we should look into whether ending async is desired at all (we can do semver major).

Answer 10 · 2020-02-05T14:24:12.000Z

Perhaps async ending can remain, and we can just refactor this part a bit:

AsyncIterator.prototype._end = function (destroy) {
  if (this._changeState(destroy ? DESTROYED : ENDED)) {
    this._readable = false;
    this.removeAllListeners('readable');
    this.removeAllListeners('data');
    this.removeAllListeners('end');
  }
};
function end(self, destroy) { self._end(destroy); }
function endAsync(self) { setImmediate(end, self); }

So that endAsync will only invoke a setImmediate if the iterator has not ended yet.
Because right now, this function can be called many times after each other, which caused the many setImmediate's

Answer 11 · 2020-02-05T14:27:25.000Z

if the iterator has not ended yet.

If and end has not been scheduled yet?

can be called many times after each other

Does that happen commonly?

Answer 12 · 2020-02-05T14:31:40.000Z

if the iterator has not ended yet.

If and end has not been scheduled yet?

Hmm yes, the refactor may not be so trivial. We may need a new 'ENDING' state for this.

can be called many times after each other

Does that happen commonly?

Yes, 252 times for 1000 elements and 1 transformers to be exact.

Answer 13 · 2020-02-05T14:32:33.000Z

Yes, 252 times for 1000 elements and 1 transformers to be exact.

Very weird; I wonder if that's a bug.

ENDING state is fine with me (semver major).

Answer 14 · 2020-02-05T14:43:24.000Z

Ok, this was surprisingly easy, done in 92539e5.
Did not break any tests, but we may want to consider adding additional tests for this later on.
Performance in browser remains as good just like before.

Answer 15 · 2020-02-05T14:53:44.000Z

I just had a look at the failing tests in my branch.
Some tests have been fixed. A few others still fail, but I think those should be rewritten based on the internal changes.

@RubenVerborgh I'll hand this over to you know to see whether or not my changes make sense, and to check if the failing tests should actually fail :-)

Answer 16 · 2020-02-05T14:59:15.000Z

Ok, this was surprisingly easy, done in 92539e5.

Good codebase 😛

@RubenVerborgh I'll hand this over to you know to see whether or not my changes make sense, and to check if the failing tests should actually fail :-)

Will do.

Answer 17 · 2020-03-02T16:37:39.000Z

I have created a branch feature/performance that contains all measurement scripts. I'll now evaluate each of the proposed solutions one by one to see which one(s) we need.

Answer 18 · 2020-03-02T17:15:05.000Z

In db25c5a, I've identified the contribution of different reasons for calling setImmediate. Number 4 will shock you.

 - BufferedIterator init	333
 - emit readable	83592
 - end async	666
 - fillBufferAsync	3080916

Answer 19 · 2020-03-02T17:16:13.000Z

Does the above contradict your earlier statement?

setImmediate(emit, this, 'readable') appeared to be the main problem here.

Answer 20 · 2020-03-02T17:35:44.000Z

Regarding whether double emissions of the end event happen often, I note for the TransformIterator benchmark that 342 iterators are created and 666 asynchronous ends are requested. It's probably not a coincidence that 666 is twice the number of BufferedIterator inits at 333. I wonder if there is an error in the logic which causes them to be ended twice, or whether it is a deliberate safety measure.

However, for the BufferedIterator benchmark 342 iterators are created in total, all 342 of them are BufferedIterator instances, and there are 342 endings. So no problem there.

All in all, it does not seem like the double ends are a major contributor to the problem (but avoiding them is of course meaningful).

Answer 21 · 2020-03-02T20:55:35.000Z

In any case, rubensworks@92539e5 with the addition of ENDING seems like the right decision. So incorporating that for sure.

Answer 22 · 2020-03-02T21:52:41.000Z

@rubensworks A logical conclusion from the above is to buffer synchronously instead of asynchronously.

The idea was to make BufferedIterator#read return as soon as possible, and schedule a refill of the buffer after the method had returned. However, it might actually be better to pay the waiting cost of buffering while reading, and thus avoid the scheduling overhead.

I have tried this idea in cdc5f30, and it yields a major performance improvement. Could you try this out in complex scenarios (such as Comunica) and tell me if you have any test failures? It does make some tests fail in AsyncIterator, but that might be due to the tests itself.

A problem with the fix is that the buffer will now refill every time one element was taken out of it. In ba437cf, I'm triggering a refill only when the buffer is half empty. Better performance, but more test failures (most of which just counting the number of reads, which are reasonably affected).

Finally, 85122af only fills the buffer when empty.

The last two commits should not cause more failures than the first (let me know if they do). However, performance might be completely different.

Answer 23 · 2020-03-03T08:01:00.000Z

Number 4 will shock you.

Great clickbait 😛

Does the above contradict your earlier statement?

setImmediate(emit, this, 'readable') appeared to be the main problem here.

Yes, that was my initial thought, which appeared to be incomplete.

The last two commits should not cause more failures than the first (let me know if they do). However, performance might be completely different.

What is the performance diff between Node and browser in your branch at the moment?
Are the current changes sufficient to make the overhead in browsers insignificant?

Through this commit, I also noticed significant performance improvement. Of course, these changes are quite invasive, so we'd probably want to avoid them if not needed.

Could you try this out in complex scenarios (such as Comunica) and tell me if you have any test failures?

Ok, will make time for this later this week.

Answer 24 · 2020-03-03T09:09:31.000Z

What is the performance diff between Node and browser in your branch at the moment?

Performance before (Node / Browser, Transform / Buffer)

NT: 100000:100:821:2525502
NB: 100000:100:1404:2550303
BT: 100000:10:13797:0
BB: 100000:10:17141:0

Performance after (Node / Browser, Transform / Buffer)

NT: 100000:100:434:502
NB: 100000:100:791:404
BT: 100000:100:565:0
BB: 100000:100:978:0

So not only does it make the browser faster, it makes Node faster too.
But we should ensure there is sufficient external validity to these findings, that it's not just on those two benchmarks.

Are the current changes sufficient to make the overhead in browsers insignificant?

Getting close, but there's apparently still some more we want to squeeze out. Not sure if puppeteer causes overhead, might want to check that too before drawing conclusions. Perhaps some other async calls are not needed either.

Through this commit, I also noticed significant performance improvement. Of course, these changes are quite invasive, so we'd probably want to avoid them if not needed.

Indeed. My idea started by looking at that commit, and thinking if we could make the bookkeeping easier. Then I wondered if we could instead remove the cause of that bookkeeping.

Answer 25 · 2020-03-04T15:00:36.000Z

Ok, I did some testing in Comunica.

The good news: all tests still pass, so there seem to be no breakages.

The bad news: performance in browsers has not been improved that much.
I executed the following query in Node and Chrome.
After the change, there is a slight performance improvement for both, but browser is still 10x slower.

Query:

SELECT ?movie ?title ?name
WHERE {
  ?movie dbpedia-owl:starring [ rdfs:label "Brad Pitt"@en ];
         rdfs:label ?title;
         dbpedia-owl:director [ rdfs:label ?name ].
  FILTER LANGMATCHES(LANG(?title), "EN")
  FILTER LANGMATCHES(LANG(?name),  "EN")
}

Before:

Iterations	Node	Browser
1	759.711ms	4754.178ms
10	3437.605ms	34451.179ms

After:

Iterations	Node	Browser
1	739.698ms	4224.340ms
10	2371.714ms	31744.079ms

How shall we go forward with this? Additional profiling in Comunica and perhaps more extensive performance test cases in AsyncIterator?

(FYI, I also tested with my branch, but the tests did break there, so my changes were a bit too extensive...)

Answer 26 · 2020-03-04T15:02:41.000Z

How shall we go forward with this?

Could you run logTrackedEvents to understand why? Perhaps there are many more end events here.

Can I easily run this test above, so I can add more logging and insights?

Answer 27 · 2020-03-04T15:20:38.000Z

logTrackedEvents with 1 iteration:

 - fillBufferAsync: _fillBuffer	3058
 - fillBufferAsync: autoStart	931
 - setImmediate: BufferedIterator init	2287
 - setImmediate: emit data	495
 - setImmediate: emit end	3
 - setImmediate: emit property	123
 - setImmediate: emit readable	10628
 - setImmediate: end async	3274
 - setImmediate: fillBufferAsync	3989
 - setImmediate: property callback	121
 - setImmediate: readAndTransform	5142
 - setImmediate: readAndTransformSimple	43

logTrackedEvents with 10 iterations:

 - fillBufferAsync: _fillBuffer	26198
 - fillBufferAsync: autoStart	7123
 - setImmediate: BufferedIterator init	19594
 - setImmediate: emit data	4950
 - setImmediate: emit end	30
 - setImmediate: emit property	2317
 - setImmediate: emit readable	92968
 - setImmediate: end async	28384
 - setImmediate: fillBufferAsync	33321
 - setImmediate: property callback	120
 - setImmediate: property callbacks	1
 - setImmediate: readAndTransform	44643
 - setImmediate: readAndTransformSimple	430

So the majority appears to be caused by readable events like before, but still quite a lot in the other places as well.

Can I easily run this test above, so I can add more logging and insights?

Nothing as clean as I've setup for AsyncIterator unfortunately.
This branch can be used for testing, but it involves some manual steps to get the browser part working (see latest commit in that branch): https://github.com/comunica/comunica/tree/perf/asynciterator

Answer 28 · 2020-03-04T15:31:19.000Z

OK. so seems the test cases are not representative for the scenario we have. (Nonetheless, I think they are good cases and would keep the outcomes.)

Perhaps would be good to have such cases indeed; I'd also like to understand what is so specific about the Comunica scenario that we hadn't captured. With some tracing, we should be able to understand why so many readable events are being fired. It's likely that many of them don't need to be async.

Answer 29 · 2020-03-04T15:37:41.000Z

With some tracing, we should be able to understand why so many readable events are being fired.

👍

Are you able to look into this? If not, I can try to do it later this week, or next week.

Also, Comunica also makes use of asynciterator-promiseproxy.js and asynciterator-union.js. Both should not have a big impact on this, but just so we're aware that these are present.

Answer 30 · 2020-03-04T15:42:53.000Z

Are you able to look into this? If not, I can try to do it later this week, or next week.

Would be great if you could set up test cases; happy to help look at possible causes and fixes.

Also, Comunica also makes use of asynciterator-promiseproxy.js and asynciterator-union.js. Both should not have a big impact on this, but just so we're aware that these are present.

No setImmediate there. Perhaps what we want to do is shim setImmediate to setTimeout and see if performance in Node and browser is the same, just so we're sure there's no other cause.

Answer 31 · 2020-03-05T12:33:09.000Z

Another clue:

The query I ran before requests ~600 pages, and creates at least one new iterator for each page.

When I measure all asynciterators that are created, I get the following types (with their number of instantiations):

   3 EmptyIterator
  78 MultiTransformIterator
 122 TransformIterator
 244 MediatedLinkedRdfSourcesAsyncRdfIterator
 249 RoundRobinUnionIterator
 611 SimpleTransformIterator
 621 ArrayIterator
 819 ClonedIterator
 983 PromiseProxyIterator

(total: 3730)

Answer 32 · 2020-03-05T13:14:16.000Z

Based on my last comment, my suspicion is that MultiTransformIterator are at the root cause of this (or at least the creation of many AsyncIterators). So I setup some additional measurements in rubensworks@37d0cdf.

Node:

elements:transformers:time(ms):setImmediates
1000:1:47:4009
1000:10:87:40072
1000:100:273:400702
10000:1:65:40009
10000:10:280:400072
10000:100:2056:4000702
100000:1:537:400009
100000:10:2447:4000072
100000:100:21768:40000702

Browser:

elements:transformers:time(ms):setImmediates
1000:1:354:0
1000:10:2916:0
1000:100:34154:0
10000:1:2776:0
10000:10:27313:0
(aborted here, too slow...)

tracked events
 - fillBufferAsync: _fillBuffer 12321333
 - fillBufferAsync: autoStart   333
 - setImmediate: BufferedIterator init  333
 - setImmediate: emit readable  24642675
 - setImmediate: end async      12321675
 - setImmediate: fillBufferAsync        12321666

Of course, this is a very extreme case, so I we probably should not try to optimize it in full. But it is an indication that we still may need to do something about the 'readable' events.

Perhaps we can try to buffer these events somehow if many of them are encountered in sequence. But doing this without lowering performance may be tricky...

Answer 33 · 2020-03-05T13:22:45.000Z

For reference, some critical parts in Comunica that heavily make use of AsyncIterator:

Answer 34 · 2020-04-04T21:20:05.000Z

@rubensworks So I was able to cut back the number of setImmediate calls drastically by:

Not emitting readable on start (01bb2f2)
Ending synchronously (c30f1b1)
Not insisting on filling the buffer all the way (aa90d22)

Alterations 1 and 2 change the semantics of events. However, they might also be specific to this benchmark given that they pertain to IntegerIterator.

Could you check if Alteration 3 by itself is sufficient to improve Comunica performance?

Additionally, at some point, Alteration 4 (7b680e4) also made a difference, but this is no longer the case for the given benchmark. I included it just in case it does make a difference for Comunica.

Answer 35 · 2020-05-13T11:39:23.000Z

I was finally able to test these changes in Comunica.
Unfortunately, alternation 4 seems to have broken things, as I get no results from my query anymore.
So I did the experiments using Alterations 1-3.

Query:

SELECT ?movie ?title ?name
WHERE {
  ?movie dbpedia-owl:starring [ rdfs:label "Brad Pitt"@en ];
         rdfs:label ?title;
         dbpedia-owl:director [ rdfs:label ?name ].
  FILTER LANGMATCHES(LANG(?title), "EN")
  FILTER LANGMATCHES(LANG(?name),  "EN")
}

Before:

Iterations	Node	Browser
1	1636.558ms	4658ms
10	3058.392ms	17262ms

After:

Iterations	Node	Browser
1	1458.051ms	4159ms
10	3560.065ms	15558ms

It looks like the performance improvements are minimal, so we're not there yet.

I also tracked all events types again, but the results here seem similar to before.

Tracked events for 1 iteration:

 - fillBufferAsync: autoStart	1175
 - setImmediate: BufferedIterator init	2531
 - setImmediate: emit data	495
 - setImmediate: emit end	3
 - setImmediate: emit property	123
 - setImmediate: emit readable	9686
 - setImmediate: end async	3472
 - setImmediate: fillBufferAsync	1175
 - setImmediate: property callback	121
 - setImmediate: readAndTransform	5142
 - setImmediate: readAndTransformSimple	43

Tracked events for 10 iterations:

 - fillBufferAsync: autoStart	9563
 - setImmediate: BufferedIterator init	22034
 - setImmediate: emit data	4950
 - setImmediate: emit end	30
 - setImmediate: emit property	2319
 - setImmediate: emit readable	87281
 - setImmediate: end async	30364
 - setImmediate: fillBufferAsync	9563
 - setImmediate: property callback	121
 - setImmediate: readAndTransform	44643
 - setImmediate: readAndTransformSimple	430

At the moment I don't have any ideas on how to go forward with this.
Did you use my additional measurements on MultiTransformIterator (rubensworks@37d0cdf) for your latest alterations?

In any case, we probably have to do more profiling and even more extensive test cases.

Answer 36 · 2020-05-15T07:52:21.000Z

Did you use my additional measurements on MultiTransformIterator (rubensworks@37d0cdf) for your latest alterations?

I did indeed.

At the moment I don't have any ideas on how to go forward with this.

The problem is that I am optimizing to the benchmark, but not to Comunica.
And there is no real correlation at the moment.

So the main thing to do for me is to test this in Comunica, preferably the smallest relevant build.
The main thing where you can help is by setting up such an environment, and then I can likely take it from there.

As far as the contents go, I think the counters are our main point of feedback.
The highest count is still readable, and Alteration 4 catches a situation in which the readable status would flip-flop two times in succession without being needed. Alteration 4 introduces some subtle timing changes (semver major), so might lead to breakage, but it should be fixable.

The next concern in numbers is readAndTransform. We might want to investigate there whether the needed transformation is indeed async. If it is not, we might want to build a dedicated synchronous transform.

Next, there's the question of asynchronous endings. Maybe we don't need those.

And finally, asynchronous initializations. Maybe we don't need those either.

On the positive side, fillBufferAsync has gone down significantly, however not with consistent result improvements if I look at Node (but that might be an outlier).

Answer 37 · 2020-05-15T08:00:46.000Z

So the main thing to do for me is to test this in Comunica, preferably the smallest relevant build.
The main thing where you can help is by setting up such an environment, and then I can likely take it from there.

Would the scripts I added in comunica/comunica@4ced255 be usable for you?
It should be possible to do something with puppeteer, but due to the long webpacking times of comunica, it may be better if we keep this as a separate manual process.

we might want to build a dedicated synchronous transform.

Many transformations in Comunica are indeed synchronous, so that may definitely be valuable.

And finally, asynchronous initializations. Maybe we don't need those either.

Not sure what you have in mind for this, but at least setting a source asynchronously should be possible, as we use this quite often in PromiseProxyIterator.

Answer 38 · 2020-05-15T09:19:15.000Z

I see some opportunities in using the microtask queue rather than the microtask queue.

Could you try the following?

const resolved = Promise.resolve(null);
function setImmediate(fn, ...rest) {
  resolved.then(() => fn(...rest));
}

I still have some termination problems with tests, but they might just be test problems due to the changed semantics.

Answer 39 · 2020-05-15T09:26:41.000Z

And we need to check:

Answer 40 · 2020-05-15T11:00:55.000Z

It looks like the performance improvements with this Promise trick are the same as with your alterations.

Before:

Iterations	Node	Browser
1	1636.558ms	3707ms
10	3058.392ms	18728ms

After:

Iterations	Node	Browser
1	ERROR	3452ms
10	ERROR	15847ms

The Node ERROR is caused by some setImmediate assumptions in Comunica, which I expect are fixable on my end.

Answer 41 · 2020-06-17T14:32:41.000Z

For posterity:

@rubensworks and myself have looked more into the issue, it seems that using a microtask solution for call order scheduling (as opposed to setImmediate) yields better performance in Node, and brings browsers down to a reasonable level. Browsers still seem to have a slower task queue implementation than Node.js, but at least the difference is much more manageable.

Bringing down the number of calls does not seem to have as much of an effect anymore with a faster queue; but we might still need to look at that given appropriate real-world cases. (Because we've also learned here that tremendous gains on some benchmarks are meaningless for some actual cases.)