Performance in browsers
rubensworks opened this issue ยท 41 comments
In comunica/comunica#561, we identified AsyncIterator as (one of) the main cause of slowdown in query performance.
While Asynciterator is very fast in Node.js, it is orders of magnitude slower in browsers such as Chrome.
The cause of this seems to be the extensive usage of setImmediate
in AsyncIterator, which is fast in Node.js, but not in browsers (due to shims).
I've started a branch on my fork to do some profiling.
This is the current output:
Node
elements,transformers,time
1000,1, 8.535ms
1000,10, 26.684ms
1000,100, 42.072ms
10000,1, 14.416ms
10000,10, 21.166ms
10000,100, 105.672ms
100000,1, 115.589ms
100000,10, 199.156ms
100000,100, 981.854ms
Browser
elements,transformers,time
1000,1, 60.205810546875ms
1000,10, 280.35693359375ms
1000,100, 2386.37890625ms
10000,1, 365.549072265625ms
10000,10, 2032.72705078125ms
10000,100, 26157.5810546875ms
100000,1, 3524.163818359375ms
100000,10, 27374.552978515625ms
100000,100, 213922.16796875ms
These results show that both the stream size and the number of transformers have a significant impact in browsers.
I'll play around a bit to see what optimizations are possible.
setImmediate(emit, this, 'readable')
appeared to be the main problem here.
I did some minor adjustments to squash nested setImmediate
into one large setImmediate
call in rubensworks@841f206.
@RubenVerborgh Could you have a look at this?
If you think this type of solution is ok, I'll clean up things, and test on some larger use cases (because I think this change may break things for some edge-cases).
With this change, we get the following performance (almost the same performance in Node and browsers):
Node
elements,transformers,time
1000,1, 6.541ms
1000,10, 32.780ms
1000,100, 30.866ms
10000,1, 8.172ms
10000,10, 18.436ms
10000,100, 144.885ms
100000,1, 24.727ms
100000,10, 140.363ms
100000,100, 1240.560ms
Browser
elements,transformers,time
1000,1, 14.528076171875ms
1000,10, 60.06884765625ms
1000,100, 72.163818359375ms
10000,1, 14.634765625ms
10000,10, 22.64697265625ms
10000,100, 149.65087890625ms
100000,1, 72.8623046875ms
100000,10, 164.00390625ms
100000,100, 1174.532958984375ms
Thanks @rubensworks, will study this in detail.
setImmediate(emit, this, 'readable') appeared to be the main problem here.
Very interesting; do you have numbers to share?
Coincidentally, one of the design considerations in asynciterator was to fix a problem in the old internal iterators within ldf-client, where an explicit state was not kept. As such, in that implementation, the readable
event would be fired much more than needed. By keeping an explicit state, we can emit readable
only when a relevant state change happened. Now there might have been bugs in that implementation, but it was a consideration.
So there are two questions:
- are all of the
readable
events correct, or are we emitting more than strictly needed? - which of the
readable
events should be emitted async, and which sync? Because we do want to avoid the scenario where areadable
listener is attached right after a method call to anAsyncIterator
, but never fired because that method emittedreadable
sync.
Very interesting; do you have numbers to share?
I didn't count, but it looked like almost every iterated element resulted in a readable event, which then caused a buffer refill call, all behind setImmediate's.
My commit does not change the number of events though, so there doesn't seem to be a performance issue with them.
are all of the readable events correct, or are we emitting more than strictly needed?
I haven't looked at it in detail, but I think that multiple readable events may be emitted for the same data element (not 100% certain about this yet though).
In any case, I don't think that the number of events is the problem, but instead the fact that they were emitted behind a setImmediate
.
which of the readable events should be emitted async, and which sync?
I also had this concern. I think we may want to have to types of readable
events:
- an async one that can be used by external consumers (and optionally internal as well).
- a sync one that can only be used internally for performance reasons (as currently implemented in my commit above).
FYI, if you want to run the performance tests in my fork, you can execute npm run perf
and npm run perf-browser
.
I've added some measurements on BufferedIterator
in 93d0e8f.
The performance difference seems to be still apparent here, especially when the number of stream elements is small:
Node
elements:transformers:time
1000:1: 9.666ms
1000:10: 37.189ms
1000:100: 62.169ms
10000:1: 48.658ms
10000:10: 38.543ms
10000:100: 294.332ms
100000:1: 50.555ms
100000:10: 281.684ms
100000:100: 2628.572ms
Browser
elements:transformers:time
1000:1: 127.3662109375ms
1000:10: 268.007080078125ms
1000:100: 496.833984375ms
10000:1: 260.447998046875ms
10000:10: 307.476806640625ms
10000:100: 763.65625ms
100000:1: 313.001220703125ms
100000:10: 543.89404296875ms
100000:100: 2924.14794921875ms
Interesting. Can we also log number of setImmediate calls?
bd5c111 now logs the number of setImmediate
calls.
(only logs in Node.js)
Transform
elements:transformers:time(ms):setImmediates
1000:1:6:5
1000:10:45:32
1000:100:38:302
10000:1:11:9
10000:10:26:36
10000:100:140:306
100000:1:30:53
100000:10:154:80
100000:100:1430:350
Buffered
elements:transformers:time(ms):setImmediates
1000:1:11:257
1000:10:52:320
1000:100:76:950
10000:1:40:461
10000:10:41:524
10000:100:382:1154
100000:1:53:455
100000:10:346:518
100000:100:2995:1148
To me, it looks like the number of setImmediate
's definitely influences the times in browsers, especially for the BufferedIterator
case. I'll have a look at possible performance improvements now...
For reference, the problem with the BufferedIterator
seemed to be caused by the number of end events: 539352a.
With this change, the number of setImmediate
calls is very slow, and the performance difference between Node and browser is minor:
Node
elements:transformers:time(ms):setImmediates
1000:1:8:4
1000:10:35:22
1000:100:34:202
10000:1:40:8
10000:10:94:26
10000:100:269:206
100000:1:57:52
100000:10:303:70
100000:100:2722:250
Browser
elements:transformers:time(ms):setImmediates
1000:1:56:0
1000:10:75:0
1000:100:68:0
10000:1:52:0
10000:10:63:0
10000:100:339:0
100000:1:118:0
100000:10:494:0
100000:100:2602:0
Very interesting. Maybe we should look into whether ending async is desired at all (we can do semver major).
Perhaps async ending can remain, and we can just refactor this part a bit:
AsyncIterator.prototype._end = function (destroy) {
if (this._changeState(destroy ? DESTROYED : ENDED)) {
this._readable = false;
this.removeAllListeners('readable');
this.removeAllListeners('data');
this.removeAllListeners('end');
}
};
function end(self, destroy) { self._end(destroy); }
function endAsync(self) { setImmediate(end, self); }
So that endAsync
will only invoke a setImmediate
if the iterator has not ended yet.
Because right now, this function can be called many times after each other, which caused the many setImmediate
's
if the iterator has not ended yet.
If and end has not been scheduled yet?
can be called many times after each other
Does that happen commonly?
if the iterator has not ended yet.
If and end has not been scheduled yet?
Hmm yes, the refactor may not be so trivial. We may need a new 'ENDING'
state for this.
can be called many times after each other
Does that happen commonly?
Yes, 252 times for 1000 elements and 1 transformers to be exact.
Yes, 252 times for 1000 elements and 1 transformers to be exact.
Very weird; I wonder if that's a bug.
ENDING
state is fine with me (semver major).
Ok, this was surprisingly easy, done in 92539e5.
Did not break any tests, but we may want to consider adding additional tests for this later on.
Performance in browser remains as good just like before.
I just had a look at the failing tests in my branch.
Some tests have been fixed. A few others still fail, but I think those should be rewritten based on the internal changes.
@RubenVerborgh I'll hand this over to you know to see whether or not my changes make sense, and to check if the failing tests should actually fail :-)
Ok, this was surprisingly easy, done in 92539e5.
Good codebase ๐
@RubenVerborgh I'll hand this over to you know to see whether or not my changes make sense, and to check if the failing tests should actually fail :-)
Will do.
I have created a branch feature/performance
that contains all measurement scripts. I'll now evaluate each of the proposed solutions one by one to see which one(s) we need.
In db25c5a, I've identified the contribution of different reasons for calling setImmediate
. Number 4 will shock you.
- BufferedIterator init 333
- emit readable 83592
- end async 666
- fillBufferAsync 3080916
Does the above contradict your earlier statement?
setImmediate(emit, this, 'readable') appeared to be the main problem here.
Regarding whether double emissions of the end
event happen often, I note for the TransformIterator
benchmark that 342 iterators are created and 666 asynchronous ends are requested. It's probably not a coincidence that 666 is twice the number of BufferedIterator
inits at 333. I wonder if there is an error in the logic which causes them to be ended twice, or whether it is a deliberate safety measure.
However, for the BufferedIterator
benchmark 342 iterators are created in total, all 342 of them are BufferedIterator
instances, and there are 342 endings. So no problem there.
All in all, it does not seem like the double ends are a major contributor to the problem (but avoiding them is of course meaningful).
In any case, rubensworks@92539e5 with the addition of ENDING
seems like the right decision. So incorporating that for sure.
@rubensworks A logical conclusion from the above is to buffer synchronously instead of asynchronously.
The idea was to make BufferedIterator#read
return as soon as possible, and schedule a refill of the buffer after the method had returned. However, it might actually be better to pay the waiting cost of buffering while reading, and thus avoid the scheduling overhead.
I have tried this idea in cdc5f30, and it yields a major performance improvement. Could you try this out in complex scenarios (such as Comunica) and tell me if you have any test failures? It does make some tests fail in AsyncIterator, but that might be due to the tests itself.
A problem with the fix is that the buffer will now refill every time one element was taken out of it. In ba437cf, I'm triggering a refill only when the buffer is half empty. Better performance, but more test failures (most of which just counting the number of reads, which are reasonably affected).
Finally, 85122af only fills the buffer when empty.
The last two commits should not cause more failures than the first (let me know if they do). However, performance might be completely different.
Number 4 will shock you.
Great clickbait ๐
Does the above contradict your earlier statement?
setImmediate(emit, this, 'readable') appeared to be the main problem here.
Yes, that was my initial thought, which appeared to be incomplete.
The last two commits should not cause more failures than the first (let me know if they do). However, performance might be completely different.
What is the performance diff between Node and browser in your branch at the moment?
Are the current changes sufficient to make the overhead in browsers insignificant?
Through this commit, I also noticed significant performance improvement. Of course, these changes are quite invasive, so we'd probably want to avoid them if not needed.
Could you try this out in complex scenarios (such as Comunica) and tell me if you have any test failures?
Ok, will make time for this later this week.
What is the performance diff between Node and browser in your branch at the moment?
Performance before (Node / Browser, Transform / Buffer)
NT: 100000:100:821:2525502
NB: 100000:100:1404:2550303
BT: 100000:10:13797:0
BB: 100000:10:17141:0
Performance after (Node / Browser, Transform / Buffer)
NT: 100000:100:434:502
NB: 100000:100:791:404
BT: 100000:100:565:0
BB: 100000:100:978:0
So not only does it make the browser faster, it makes Node faster too.
But we should ensure there is sufficient external validity to these findings, that it's not just on those two benchmarks.
Are the current changes sufficient to make the overhead in browsers insignificant?
Getting close, but there's apparently still some more we want to squeeze out. Not sure if puppeteer causes overhead, might want to check that too before drawing conclusions. Perhaps some other async calls are not needed either.
Through this commit, I also noticed significant performance improvement. Of course, these changes are quite invasive, so we'd probably want to avoid them if not needed.
Indeed. My idea started by looking at that commit, and thinking if we could make the bookkeeping easier. Then I wondered if we could instead remove the cause of that bookkeeping.
Ok, I did some testing in Comunica.
The good news: all tests still pass, so there seem to be no breakages.
The bad news: performance in browsers has not been improved that much.
I executed the following query in Node and Chrome.
After the change, there is a slight performance improvement for both, but browser is still 10x slower.
Query:
SELECT ?movie ?title ?name
WHERE {
?movie dbpedia-owl:starring [ rdfs:label "Brad Pitt"@en ];
rdfs:label ?title;
dbpedia-owl:director [ rdfs:label ?name ].
FILTER LANGMATCHES(LANG(?title), "EN")
FILTER LANGMATCHES(LANG(?name), "EN")
}
Before:
Iterations | Node | Browser |
---|---|---|
1 | 759.711ms | 4754.178ms |
10 | 3437.605ms | 34451.179ms |
After:
Iterations | Node | Browser |
---|---|---|
1 | 739.698ms | 4224.340ms |
10 | 2371.714ms | 31744.079ms |
How shall we go forward with this? Additional profiling in Comunica and perhaps more extensive performance test cases in AsyncIterator?
(FYI, I also tested with my branch, but the tests did break there, so my changes were a bit too extensive...)
How shall we go forward with this?
Could you run logTrackedEvents
to understand why? Perhaps there are many more end
events here.
Can I easily run this test above, so I can add more logging and insights?
logTrackedEvents
with 1 iteration:
- fillBufferAsync: _fillBuffer 3058
- fillBufferAsync: autoStart 931
- setImmediate: BufferedIterator init 2287
- setImmediate: emit data 495
- setImmediate: emit end 3
- setImmediate: emit property 123
- setImmediate: emit readable 10628
- setImmediate: end async 3274
- setImmediate: fillBufferAsync 3989
- setImmediate: property callback 121
- setImmediate: readAndTransform 5142
- setImmediate: readAndTransformSimple 43
logTrackedEvents
with 10 iterations:
- fillBufferAsync: _fillBuffer 26198
- fillBufferAsync: autoStart 7123
- setImmediate: BufferedIterator init 19594
- setImmediate: emit data 4950
- setImmediate: emit end 30
- setImmediate: emit property 2317
- setImmediate: emit readable 92968
- setImmediate: end async 28384
- setImmediate: fillBufferAsync 33321
- setImmediate: property callback 120
- setImmediate: property callbacks 1
- setImmediate: readAndTransform 44643
- setImmediate: readAndTransformSimple 430
So the majority appears to be caused by readable events like before, but still quite a lot in the other places as well.
Can I easily run this test above, so I can add more logging and insights?
Nothing as clean as I've setup for AsyncIterator unfortunately.
This branch can be used for testing, but it involves some manual steps to get the browser part working (see latest commit in that branch): https://github.com/comunica/comunica/tree/perf/asynciterator
OK. so seems the test cases are not representative for the scenario we have. (Nonetheless, I think they are good cases and would keep the outcomes.)
Perhaps would be good to have such cases indeed; I'd also like to understand what is so specific about the Comunica scenario that we hadn't captured. With some tracing, we should be able to understand why so many readable events are being fired. It's likely that many of them don't need to be async.
With some tracing, we should be able to understand why so many readable events are being fired.
๐
Are you able to look into this? If not, I can try to do it later this week, or next week.
Also, Comunica also makes use of asynciterator-promiseproxy.js and asynciterator-union.js. Both should not have a big impact on this, but just so we're aware that these are present.
Are you able to look into this? If not, I can try to do it later this week, or next week.
Would be great if you could set up test cases; happy to help look at possible causes and fixes.
Also, Comunica also makes use of asynciterator-promiseproxy.js and asynciterator-union.js. Both should not have a big impact on this, but just so we're aware that these are present.
No setImmediate
there. Perhaps what we want to do is shim setImmediate
to setTimeout
and see if performance in Node and browser is the same, just so we're sure there's no other cause.
Another clue:
The query I ran before requests ~600 pages, and creates at least one new iterator for each page.
When I measure all asynciterators that are created, I get the following types (with their number of instantiations):
3 EmptyIterator
78 MultiTransformIterator
122 TransformIterator
244 MediatedLinkedRdfSourcesAsyncRdfIterator
249 RoundRobinUnionIterator
611 SimpleTransformIterator
621 ArrayIterator
819 ClonedIterator
983 PromiseProxyIterator
(total: 3730)
Based on my last comment, my suspicion is that MultiTransformIterator
are at the root cause of this (or at least the creation of many AsyncIterators). So I setup some additional measurements in rubensworks@37d0cdf.
Node:
elements:transformers:time(ms):setImmediates
1000:1:47:4009
1000:10:87:40072
1000:100:273:400702
10000:1:65:40009
10000:10:280:400072
10000:100:2056:4000702
100000:1:537:400009
100000:10:2447:4000072
100000:100:21768:40000702
Browser:
elements:transformers:time(ms):setImmediates
1000:1:354:0
1000:10:2916:0
1000:100:34154:0
10000:1:2776:0
10000:10:27313:0
(aborted here, too slow...)
tracked events
- fillBufferAsync: _fillBuffer 12321333
- fillBufferAsync: autoStart 333
- setImmediate: BufferedIterator init 333
- setImmediate: emit readable 24642675
- setImmediate: end async 12321675
- setImmediate: fillBufferAsync 12321666
Of course, this is a very extreme case, so I we probably should not try to optimize it in full. But it is an indication that we still may need to do something about the 'readable'
events.
Perhaps we can try to buffer these events somehow if many of them are encountered in sequence. But doing this without lowering performance may be tricky...
For reference, some critical parts in Comunica that heavily make use of AsyncIterator:
@rubensworks So I was able to cut back the number of setImmediate
calls drastically by:
- Not emitting
readable
on start (01bb2f2) - Ending synchronously (c30f1b1)
- Not insisting on filling the buffer all the way (aa90d22)
Alterations 1 and 2 change the semantics of events. However, they might also be specific to this benchmark given that they pertain to IntegerIterator
.
Could you check if Alteration 3 by itself is sufficient to improve Comunica performance?
Additionally, at some point, Alteration 4 (7b680e4) also made a difference, but this is no longer the case for the given benchmark. I included it just in case it does make a difference for Comunica.
I was finally able to test these changes in Comunica.
Unfortunately, alternation 4 seems to have broken things, as I get no results from my query anymore.
So I did the experiments using Alterations 1-3.
Query:
SELECT ?movie ?title ?name
WHERE {
?movie dbpedia-owl:starring [ rdfs:label "Brad Pitt"@en ];
rdfs:label ?title;
dbpedia-owl:director [ rdfs:label ?name ].
FILTER LANGMATCHES(LANG(?title), "EN")
FILTER LANGMATCHES(LANG(?name), "EN")
}
Before:
Iterations | Node | Browser |
---|---|---|
1 | 1636.558ms | 4658ms |
10 | 3058.392ms | 17262ms |
After:
Iterations | Node | Browser |
---|---|---|
1 | 1458.051ms | 4159ms |
10 | 3560.065ms | 15558ms |
It looks like the performance improvements are minimal, so we're not there yet.
I also tracked all events types again, but the results here seem similar to before.
Tracked events for 1 iteration:
- fillBufferAsync: autoStart 1175
- setImmediate: BufferedIterator init 2531
- setImmediate: emit data 495
- setImmediate: emit end 3
- setImmediate: emit property 123
- setImmediate: emit readable 9686
- setImmediate: end async 3472
- setImmediate: fillBufferAsync 1175
- setImmediate: property callback 121
- setImmediate: readAndTransform 5142
- setImmediate: readAndTransformSimple 43
Tracked events for 10 iterations:
- fillBufferAsync: autoStart 9563
- setImmediate: BufferedIterator init 22034
- setImmediate: emit data 4950
- setImmediate: emit end 30
- setImmediate: emit property 2319
- setImmediate: emit readable 87281
- setImmediate: end async 30364
- setImmediate: fillBufferAsync 9563
- setImmediate: property callback 121
- setImmediate: readAndTransform 44643
- setImmediate: readAndTransformSimple 430
At the moment I don't have any ideas on how to go forward with this.
Did you use my additional measurements on MultiTransformIterator
(rubensworks@37d0cdf) for your latest alterations?
In any case, we probably have to do more profiling and even more extensive test cases.
Did you use my additional measurements on
MultiTransformIterator
(rubensworks@37d0cdf) for your latest alterations?
I did indeed.
At the moment I don't have any ideas on how to go forward with this.
The problem is that I am optimizing to the benchmark, but not to Comunica.
And there is no real correlation at the moment.
So the main thing to do for me is to test this in Comunica, preferably the smallest relevant build.
The main thing where you can help is by setting up such an environment, and then I can likely take it from there.
As far as the contents go, I think the counters are our main point of feedback.
The highest count is still readable
, and Alteration 4 catches a situation in which the readable status would flip-flop two times in succession without being needed. Alteration 4 introduces some subtle timing changes (semver major), so might lead to breakage, but it should be fixable.
The next concern in numbers is readAndTransform
. We might want to investigate there whether the needed transformation is indeed async. If it is not, we might want to build a dedicated synchronous transform.
Next, there's the question of asynchronous endings. Maybe we don't need those.
And finally, asynchronous initializations. Maybe we don't need those either.
On the positive side, fillBufferAsync
has gone down significantly, however not with consistent result improvements if I look at Node (but that might be an outlier).
So the main thing to do for me is to test this in Comunica, preferably the smallest relevant build.
The main thing where you can help is by setting up such an environment, and then I can likely take it from there.
Would the scripts I added in comunica/comunica@4ced255 be usable for you?
It should be possible to do something with puppeteer, but due to the long webpacking times of comunica, it may be better if we keep this as a separate manual process.
we might want to build a dedicated synchronous transform.
Many transformations in Comunica are indeed synchronous, so that may definitely be valuable.
And finally, asynchronous initializations. Maybe we don't need those either.
Not sure what you have in mind for this, but at least setting a source asynchronously should be possible, as we use this quite often in PromiseProxyIterator.
I see some opportunities in using the microtask queue rather than the microtask queue.
Could you try the following?
const resolved = Promise.resolve(null);
function setImmediate(fn, ...rest) {
resolved.then(() => fn(...rest));
}
I still have some termination problems with tests, but they might just be test problems due to the changed semantics.
It looks like the performance improvements with this Promise trick are the same as with your alterations.
Before:
Iterations | Node | Browser |
---|---|---|
1 | 1636.558ms | 3707ms |
10 | 3058.392ms | 18728ms |
After:
Iterations | Node | Browser |
---|---|---|
1 | ERROR | 3452ms |
10 | ERROR | 15847ms |
The Node ERROR is caused by some setImmediate assumptions in Comunica, which I expect are fixable on my end.
For posterity:
@rubensworks and myself have looked more into the issue, it seems that using a microtask solution for call order scheduling (as opposed to setImmediate
) yields better performance in Node, and brings browsers down to a reasonable level. Browsers still seem to have a slower task queue implementation than Node.js, but at least the difference is much more manageable.
Bringing down the number of calls does not seem to have as much of an effect anymore with a faster queue; but we might still need to look at that given appropriate real-world cases. (Because we've also learned here that tremendous gains on some benchmarks are meaningless for some actual cases.)