nolanlawson/optimize-js

Benchmark unfairly excludes compilation time.

aickin opened this issue · 4 comments

As I've looked at Chrome timelines and thought about it, I have some concerns about the benchmark. I think it's somewhat biased in favor of optimize-js because optimize-js moves some CPU time from execution to compilation, but the benchmark only measures execution.

Basically, unoptimized code does a quick parse during the compilation phase, and then does a slow/complete parse during the execution phase (along with the actual execution of the code, obviously). After optimize-js has been run, the compilation phase does a slow/complete parse, and the execution phase just runs the code. But since the benchmark measures the time between executing the first line and the last line of the script, it is measuring only the execution phase, which means that the time increase in the compilation phase gets lost. I confirm this by looking at Chrome timelines; after optimize-js runs, the compilation phase goes up considerably, but the benchmark just reports the execute time.

I think the fairest benchmark is compilation + first execution, as this is what most pages care about for first load. What I don't know is how to measure that. Here are some ideas, all problematic:

  1. Start measurement from the moment that the script element is added to the DOM. This is what the cost-of-small-modules benchmark does, and it clearly shows time moving from the execution phase to the compilation/loading phase when you use optimize-js. The downside, of course, is that it includes loading time as well in the compilation phase. If all the files are served locally, this probably isn't a huge issue, but it is a source of error in the measurements.
  2. Start measurement from the moment that the script element is added to the DOM, but subtract the time from the Resource Timing API. This is the same as 1, except that you would use the Resource Timing API to see how long it took to load the script from the network and subtract that amount from the measurement. This would reduce the network-based error in 1, but it may not work perfectly, because browsers might start the compilation phase before receiving the last byte of the script. If this is the case, then subtracting the load time of the script would hide some of the compilation phase. More conservatively, you could just subtract TTFB from the loading/compilation phase. Also, Resource Timing isn't available on Safari.
  3. Download the script with XHR/fetch, and call eval() on it. The other possibility is to download the code and then eval it. The benefit here is that you definitely are capturing compilation + execution without getting any network load time mixed in. The downside is that I could totally believe that browsers disable some perf optimizations for eval, so it's possible the numbers will be misleading.

Does this make sense? Do you have other ideas how to measure this (or other thoughts)?

So you've highlighted an important issue here; there's a measurement that the benchmark is not taking into account and this does have the potential to unfairly inflate optimize-js's numbers.

I think the responsible thing would be to add an additional number to measure the compilation time (which indeed should go up). But the question is how best to observe compilation time from JavaScript, which turns out to be a difficult question to answer.

1․ Start measurement from the moment that the script element is added to the DOM.

As you say this would get muddied by network overhead, which would make the tests inconsistent and unreliable. You could use a cached version of the script, but the problem is that some engines may pre-JIT a cached script (I'm told V8 does this).

2․ Start measurement from the moment that the script element is added to the DOM, but subtract the time from the Resource Timing API.

Besides the Safari problem, there's actually no guarantee that the browser will only begin parsing once it's received the last byte of JavaScript – it can start parsing the moment it starts to receive script.

3․ Download the script with XHR/fetch, and call eval() on it.

eval() has side effects and so isn't exactly equivalent to <script> tags. Also <script> tags are more the norm on the web, so it's better to test those.

Due to all these issues, I'd say your first suggestion is probably the best one, but we should do the following:

  1. Download the script as text/plaintext (to avoid parsing)
  2. Convert it to a blob/dataURL (not sure which one is best but I imagine blob since otherwise browsers may cache the dataURL. we will probably want to append random whitespace to the script as well)
  3. Set the blob/dataURL as the source of a non-async non-deferred script and add it to the DOM
  4. Measure the time from adding the script to the DOM and the first line of code in the script as "load/compile time" and the current measured time as "run/execute time"

I think reporting both numbers is important to ensure we don't mislead people on the benefits of optimize-js, so we should definitely do this. 👍

/cc @toddreifsteck who helped walk me through some of the tricky bits of this :)

Oooooh, I love the idea of a blob/dataURL; hadn't occurred to me, and I am agreed on all counts about the imperfections of the three ideas I proposed. I didn't even know about using blobs as URLs; that's really neat!

And (perhaps obviously) I am +1 that this is important to do to make sure the numbers are fair. Thanks!

This is my highest-priority task for optimize-js before merging in recent changes. I'm pretty sure the current benchmarks are far too optimistic.

OK, I believe I have found a better system. By creating script elements, setting their textContent to the script source, then randomizing the source by adding a random string, then doing document.body.appendChild(script), we can fully measure compile+execute without measuring the network.

I set a mark right before appendChild(script), and another mark right at the end of the script. I confirmed in the Chrome timeline and Edge (via Windows Performance Analyzer) that the performance measures include parsing and executing. I can also confirm that Chrome is re-parsing every time due to the randomization (which it wouldn't otherwise; it would cache the parsed/JITed code).

I'll update the benchmark and post new numbers and a new version of optimize-js as soon as I can. Testing so far indicates my numbers were indeed overly optimistic, but optimize-js still comes out on top for the most part. Thank you for raising this issue! 😃