v8/web-tooling-benchmark

Geomean

jfbastien opened this issue ยท 11 comments

It would be great for geomean to have a confidence interval.

Also, the calculation of geomean seems off?

The value shown is the geometric mean of the average runs/sec of the individual tests. An earlier version was computing the geometric mean of the times of the individual tests, which was obviously off, probably you had seen that.

We'll be looking into adding confidence interval for the geometric mean.

tebbi commented

The confidence intervals we compute for the individual tests is actually the relative standard error of the mean of the individual runtimes. This would make perfect sense if the individual samples were statistically independent. However, since the samples are iterations running in the same isolate, they share feedback, optimized code and GC. So they are not independent, especially the first iterations are always slower than the later iterations due to missing feedback and optimized code. So the reported error is systematically too big.

tebbi commented

One way to fix this is to separate warm-up and only measure the steady-state of the system. However, this assumes that the steady-state of the VM is always the same, which is probably not the case.

Another way is to re-run the complete benchmark as independently as possible (in a new iframe/realm, or even better: a new process) and only compute confidence intervals based on results of complete runs. This is what the ARES-6 runner does.

My team is obviously biased towards liking the ARES-6 approach, in part for the reasons that you state.

I guess we could remove the first N samples and do the average/stddev solely on the remaining samples.

iframe/realm is not really an option here, since Node doesn't support realms and this is primarily interesting for Node. There's only the vm module, but that does a lot of magic with the global object, which makes it unsuitable for representative performance measurement.

Couple general comments, but no strong opinions:

  • Agree that testing in VM module is likely more trouble than it's worth (much work went in to eshost to make these realms work-ish, but it's not 100% representative)
  • Disagree with tossing out "outliers" unless it's just a single warmup iteration or something. Removing samples should only be done with very strong motivation and justification around why that datapoint is not statistically relevant.
  • High variance seems acceptable. Addressing high variance with more iterations takes time but likely gives richer data. Unfortunately variance depends on many factors and getting acceptably low variance may require wildly different number of samples on different implementations/environments. Not sure how to solve this beyond making sure the number of samples is high enough for the worst-case variance observed in practice?

If anything, I'd think this benchmark would only really care about the first run because that's what the tools hit, no? We do multi-runs to get statistical significance, so maybe in a node context you actually want to run the tests from the command-line and do aggregation there, whereas in a web or browser shell context the ARES-6 approach of creating a new iframe/realm is appropriate?

The first run is almost completely irrelevant in practice. The common case is that you have webpack driving the build via loaders (i.e. Babel, TypeScript, PostCSS, Prepack, etc.) and those are run on hundreds to thousands of files. Oftentimes the build takes tens of seconds up to several minutes for a realistic web application. Same for tools like prettier that usually run on several files at once. All of this usually runs in a single Node process, in the same isolate.

@bmeurer ah that's interesting! Shouldn't the benchmark try to capture this type of call graph? I'm guessing you think what you've come up with is highly correlated with that call graph? If so then your setup makes sense.

I've experimented with different setups, including just benchmarking an actual webpack build with all the loaders you needed for that project. But this usually didn't work out well, partially because of I/O and the dependency on Node, which made it impossible to run in the browser/shell. This current setup seems to be matching reality quite well, while mocking out any kind of I/O and creating representative versions for browser/shell.

I'm not saying this is the only possible solution, and I'm certainly open for potential changes here in the future.

tebbi commented

After investigating more, it turned out that the confidence interval numbers we produced were pretty much useless (details and reasoning here). So we disabled the feature for now.