Investigate CPU throttling effects
camillobruni opened this issue · 11 comments
We should measure the impact on various runner configurations on the final score on various platforms:
- Android (low-end and high-end devices)
- macOS (Apple Silicon and intel)
- linux (probably mostly intel desktop)
- windows (intel)
These are the axis we want to test:
- measurementMode: rAF vs. timer
- warmup Time: 0ms and 50ms
- step delay: 0ms, 50ms, 200ms
TODO:
- Add params for step delay
- Compile complete list of platforms we want to test
- Compile list of test urls with parameters want to check
Once #202 lands we can use it directly until, I've hosted my branch:
- Measurement mode,
measurementMethod in ['raf', 'timer']
- Delayed steps without warmup,
warmupBeforeSync in [0, 25, 50, 100, 200]
- Step delay,
waitBeforeSync in [0, 25, 50, 100, 200]
I've used this script to generate all the URLs
BASE_URL="http://localhost:7000/"
for MODE in timer raf; do
for WARMUP in 0 50; do
for DELAY in 0 25 50 100 200; do
URL="${BASE_URL}?waitBeforeSync=${DELAY}&warmupBeforeSync=${WARMUP}&measurementMethod=${MODE}";
echo $URL
done;
done;
done;
Laptop- DELL Latitude 5540 (DC) with Intel i5-1350p (4P+8E [16 LC])(4.7GHz, 3.5GHZ) CPU
Browsers: Chrome 114.0.5735.134 & FireFox 114.0.1
Commit #c46e191e5207229b9f4daf1921cd30ffc20c390c was used. Removed Suite "TodoMVC-React-Complex-DOM" as it was failing intermittently with Safari on M2 MacOS.
Detailed data is available here.
Speedometer 3.0 Geomean (timeout based) | ||||
---|---|---|---|---|
Delay | Chrome + no Warmup | FireFox + no Warmup | Chrome + 50ms Warmup | FireFox + 50ms Warmup |
0.0 ms | 1.00x | 1.00x | 0.88x | 1.12x |
25.0 ms | 0.96x | 1.12x | 0.92x | 1.10x |
50.0 ms | 1.16x | 1.28x | 1.00x | 1.11x |
100.0 ms | 1.34x | 1.30x | 1.13x | 1.11x |
200.0 ms | 1.39x | 1.31x | 1.15x | 1.11x |
0.0 ms | 59.8 ms | 79.3 ms | 52.7 ms | 88.4 ms |
25.0 ms | 57.6 ms | 88.8 ms | 55.1 ms | 87.5 ms |
50.0 ms | 69.5 ms | 101.2 ms | 59.5 ms | 88.3 ms |
100.0 ms | 80.0 ms | 103.1 ms | 67.7 ms | 88.2 ms |
200.0 ms | 83.1 ms | 104.1 ms | 68.6 ms | 88.2 ms |
Speedometer 3.0 Geomean (raf based) | ||||
Delay | Chrome+ no Warmup | FireFox + no Warmup | Chrome + 50ms Warmup | FireFox + 50ms Warmup |
0.0 ms | 1.02x | 1.13x | 1.11x | 1.14x |
25.0 ms | 1.08x | 1.26x | 1.09x | 1.10x |
50.0 ms | 1.30x | 1.28x | 1.13x | 1.13x |
100.0 ms | 1.38x | 1.33x | 1.14x | 1.21x |
200.0 ms | 1.43x | 1.32x | 1.14x | 1.12x |
0.0 ms | 60.8 ms | 89.6 ms | 66.7 ms | 90.1 ms |
25.0 ms | 64.9 ms | 100.2 ms | 65.5 ms | 87.3 ms |
50.0 ms | 77.7 ms | 101.2 ms | 67.8 ms | 89.9 ms |
100.0 ms | 82.8 ms | 105.6 ms | 68.0 ms | 95.7 ms |
200.0 ms | 85.8 ms | 105.0 ms | 68.0 ms | 89.0 ms |
Hm... looks like there is 10% discrepancy in Firefox when a delay of 2550ms is added when there is no warmup.
Also there is quite a substantial different of the rAF vs timeout numbers on Firefox, while I didn't really see this on macOS.
Three run data was collected on the DELL Latitude 5540 (DC) with Intel i5-1350p (4P+8E [16 LC])(4.7GHz, 3.5GHZ) CPU (DC mode) to verify the 1.13x seen with Firefox vs 1.02x seen with Chrome when using rAF vs Timer measurement methodologies. No significant difference between one run and three run data was seen.
Measurements were also collected on an Intel Desktop (i9 13900K) with Windows using Chrome and FireFox for the default case only to verify behavior seen by @bas Schouten with i9 12900HK (See Slack channel)
The following table shows a summary. Detailed data is [here]. (https://docs.google.com/spreadsheets/d/1k9XK86b8gAStuF3y7lqBK8HvL_5eIrRCPF9397ttGNM/edit?usp=sharing).
Speedometer 3.0 Geomean (timeout based) (Lower duration [in ms] is better) | |||||
---|---|---|---|---|---|
Three run data | One run data | Three run data | |||
i9_13900K (Desktop) | i5-1350p (Laptop) DC | i5-1350p (Laptop) DC | i5-1350p (Laptop) DC | ||
Delay | i9_13900K_Chrome + no warm up | i9_13900K_Firefox + no warm up | i5_1350p_FireFox + no warm up | i5_1350p_Chrome + no warm up | i5_1350p_FireFox + no warm up |
0.0 ms | 1.00x | 1.00x | 1.00x | 1.00x | 1.00x |
25.0 ms | 1.12x | 0.96x | 1.12x | ||
50.0 ms | 1.26x | 1.16x | 1.28x | ||
100.0 ms | 1.30x | 1.34x | 1.30x | ||
200.0 ms | 1.33x | 1.39x | 1.31x | ||
0.0 ms | 35.20 | 49.16 | 80.60 | 59.82 | 79.30 |
25.0 ms | 89.92 | 57.59 | 88.80 | ||
50.0 ms | 101.56 | 69.51 | 101.20 | ||
100.0 ms | 105.09 | 80.00 | 103.10 | ||
200.0 ms | 107.34 | 83.05 | 104.10 | ||
Speedometer 3.0 Geomean (raf based) (Lower duration [in ms] is better) | |||||
Delay | i5_1350p_FireFox + no warm up | i5_1350p_Chrome + no warm up | i5_1350p_FireFox + no warm up | ||
0.0 ms | 1.05x | 1.02x | 1.10x | 1.02x | 1.13x |
25.0 ms | 1.12x | 1.08x | 1.26x | ||
50.0 ms | 1.25x | 1.30x | 1.28x | ||
100.0 ms | 1.33x | 1.38x | 1.33x | ||
200.0 ms | 1.34x | 1.43x | 1.32x | ||
0.0 ms | 37.07 | 50.17 | 88.45 | 60.80 | 89.60 |
25.0 ms | 90.62 | 64.87 | 100.20 | ||
50.0 ms | 101.05 | 77.66 | 101.20 | ||
100.0 ms | 107.42 | 82.81 | 105.60 | ||
200.0 ms | 107.88 | 85.82 | 105.00 |
Some observations:
There is a difference in behavior with Firefox on desktop and mobile systems with Intel CPUs and Windows OS with the default test case – No Warmup, No delay when comparing Timer vs rAF measurement methodologies.
- On desktop, for the default case of Timer vs rAF with no delay and no warm up, Chrome and Firefox rAF relative time performance is comparable (1.05x and 1.02x respectively)
- On a laptop rAF vs Timer is 1.02x to 1.07x on Chrome and ~1.10x on Firefox with no delay, with or without warmup.
- On a laptop, of the 17 workloads the following 4 have ratios above 1.10x for relative time with rAF vs Timer. TodoMVC-Svelte is the most impacted.
no warm up, no delay | 50 ms warm up, no delay | |
---|---|---|
TodoMVC-Preact | 1.34x | 1.53x |
TodoMVC-Svelte | 1.54x | 1.95x |
NewsSite-Next | 1.13x | 1.12x |
Charts-observable-plot | 1.44x | 1.43x |
See detailed info in this sheet - workload component steps with ratios above 1.10x are highlighted in green in the tab "RAF comparison”. Tab "RAF comparison" columns K through R shows similar comparison with M2 Firefox (From Camillo’s data)
Note: The "50ms warmup" data is not being shown in the summary (formatting issues in github?) - please see the detailed sheet
Thank you for the data! Now that we've established 25ms delay introduces a degradation in the score when using rAF, it would be useful to observe what happens with 5ms, 10ms, 15ms, and 20ms delays. If we see no difference between 0ms and 15ms, then we're probably good to go because it means any delay of up to 15ms is of little to no consequence since rAF method will only introduce at most 15ms difference (assuming the step itself takes at least 1ms to run).
I've investigated some of the above situations, here's a couple of observations:
- Without rAF, delay, or warmup, if tests run fast enough they won't get paint captured in the async time because no paint will have been scheduled by the time the settimeout runnable happens.
- With rAF, warmup, or delay, generally a paint will have occurred and a new paint requested, as such there are two options, a browser prioritizes painting, and the async time increases, or a browser prioritizes running the timeout, which is bad for the user, but would yield a better score.
- Both Chrome and Firefox prioritize the paint. So in the warmup case any negative performance impact is because of running the paint.
- rAF should guarantee a paint occurs.
- Paint times can be different per browser, per test, and therefore affect tests differently.
- On a fast test like Svelte, the paint time is a much higher relative portion on the test.
I think everything I've seen here indicates that the rAF variant is better and removes a method of gaming the benchmark that feels undesirable. I also think small differences in the score are well explained by the profile data I'm seeing as being a real effect from performance differences in the paint. From my perspective that makes rAF the better choice.
To be clear, while the information rniwa asked for here would be nice to have, I don't think it would change my decision, for two reasons:
- A 15ms delay -is- likely to affect the scores, because of the aforementioned painting being measured more often.
- That effect is likely larger than any throttling effects so it wouldn't be an effective method of determining the impact of throttling itself.
From Firefox perspective my recommendation would be to use rAF with no delay or warmup based on the data that I've seen.
Now that we've switched to rAF based approach, can we call this done? Is there anything left to do?
Now that we've switched to rAF based approach, can we call this done? Is there anything left to do?
I agree. I think we are done here as far as Sp3 is concerned.