Inconsistent `perf:library` results
schneems opened this issue ยท 2 comments
The goal of perf:library
was to be able to provide a pretty solid single point where benchmarks could be run and a single, trustworthy output could be generated.
I've made progress on that goal, but we're not quite there yet. For example I ran the same test 10 different times last night and here's what I saw (on code triage):
$ env SHAS_TO_TEST='6aab468,c321bd2' DERAILED_PATH_TO_LIBRARY='.' bundle exec derailed exec perf:library
Executed 10 times gave me:
๐๐๐(NOT Statistically Significant) ๐๐๐
[6aab468] "Fstring gem" - (3.225173 seconds)
SLOWER ๐ข๐ข๐ข by:
0.9922x [older/newer]
-0.7862% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.2000145 seconds)
Iterations per sample: 200
Samples: 200
Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.09999999999999998
๐๐๐(NOT Statistically Significant) ๐๐๐
[6aab468] "Fstring gem" - (3.210941 seconds)
FASTER ๐๐๐ by:
1.0020x [older/newer]
0.1978% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.2173045 seconds)
Iterations per sample: 200
Samples: 200
Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.05999999999999994
๐๐๐(NOT Statistically Significant) ๐๐๐
[6aab468] "Fstring gem" - (3.208945 seconds)
SLOWER ๐ข๐ข๐ข by:
0.9993x [older/newer]
-0.0660% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.206828 seconds)
Iterations per sample: 200
Samples: 200
Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.05999999999999994
๐๐๐(NOT Statistically Significant) ๐๐๐
[6aab468] "Fstring gem" - (3.2028555 seconds)
FASTER ๐๐๐ by:
1.0012x [older/newer]
0.1245% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.2068465 seconds)
Iterations per sample: 200
Samples: 200
Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.07500000000000001
โค๏ธ โค๏ธ โค๏ธ (Statistically Significant) โค๏ธ โค๏ธ โค๏ธ
[6aab468] "Fstring gem" - (3.186942 seconds)
FASTER ๐๐๐ by:
1.0083x [older/newer]
0.8270% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.213518 seconds)
Iterations per sample: 200
Samples: 105
Test type: Kolmogorov Smirnov
Is significant? (max > critical): true
D critical: 0.16891057858422326
D max: 0.2380952380952381
๐๐๐(NOT Statistically Significant) ๐๐๐
[6aab468] "Fstring gem" - (3.200376 seconds)
FASTER ๐๐๐ by:
1.0054x [older/newer]
0.5373% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.217666 seconds)
Iterations per sample: 200
Samples: 200
Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.10000000000000003
๐๐๐(NOT Statistically Significant) ๐๐๐
[6aab468] "Fstring gem" - (3.204838 seconds)
FASTER ๐๐๐ by:
1.0013x [older/newer]
0.1256% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.2088685 seconds)
Iterations per sample: 200
Samples: 200
Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.095
๐๐๐(NOT Statistically Significant) ๐๐๐
[6aab468] "Fstring gem" - (3.208226 seconds)
SLOWER ๐ข๐ข๐ข by:
0.9973x [older/newer]
-0.2692% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.199613 seconds)
Iterations per sample: 200
Samples: 200
Test type: Kolmogorov Smirnov
Is significant? (max > critical): false
D critical: 0.12238734153404084
D max: 0.07
โค๏ธ โค๏ธ โค๏ธ (Statistically Significant) โค๏ธ โค๏ธ โค๏ธ
[6aab468] "Fstring gem" - (3.165487 seconds)
FASTER ๐๐๐ by:
1.0200x [older/newer]
1.9636% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.228888 seconds)
Iterations per sample: 200
Samples: 61
Test type: Kolmogorov Smirnov
Is significant? (max > critical): true
D critical: 0.22160858543889803
D max: 0.26229508196721313
โค๏ธ โค๏ธ โค๏ธ (Statistically Significant) โค๏ธ โค๏ธ โค๏ธ
[6aab468] "Fstring gem" - (3.186347 seconds)
FASTER ๐๐๐ by:
1.0100x [older/newer]
0.9945% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.2183525 seconds)
Iterations per sample: 200
Samples: 176
Test type: Kolmogorov Smirnov
Is significant? (max > critical): true
D critical: 0.1304653444395813
D max: 0.17613636363636365
Results of 10 runs
- 7 not significant
- Faster - 4
- Slower - 3
- 3 significant
- Faster - 3
- Slower - None
So these results aren't too bad. We are consistent in that it seems the runs that say "statistically significant" are all pointing in the right direction (saying the code got faster).
Based off of this set you could say "if it's not significant, then discard the results" and you would end up with 3 tests showing the same thing, which is good.
However here's an older case that I had with the exact same code commits but it was showing statistically significant and the results did not agree with the other 3 tests.
โค๏ธ โค๏ธ โค๏ธ (Statistically Significant) โค๏ธ โค๏ธ โค๏ธ
[6aab468] "Fstring gem" - (4.445098688172043 seconds)
SLOWER by:
0.7744x [older/newer]
-29.1292% [(older - newer) / older * 100]
[c321bd2] "Merge pull request #1188 from codetriage/schneems/upsert-doc-methods" - (3.442366107526882 seconds)
Iterations per sample: 200
Samples: 93
Test type: Kolmogorov Smirnov
Is significant? (max > critical): true
D critical: 0.17947750551100136
D max: 0.30107526881720426
Yikes, in the prior 10 runs even the statistically significant results are saying that it's only around a 1% difference while this is claiming things got 30% slower.
This was before I switched from average to median. Doing this does make the results not quite so dramatic and 30% slower becomes 9.4% slower. But the problem still persists. This is very far from the "consensus" picture that running the tests 10 times shows.
I think it's statistically significant because it is definitely two different curves, however, the outliers on the "new" data set are massive here is a comparison of tail numbers:
Old New
3.925983 6.560837
4.049389 6.708022
4.065886 6.761255
4.173183 6.842483
4.21175 8.125128
4.212002 8.129534
4.243142 8.302587
4.367608 8.7203
4.769074 10.282378
4.792769 10.284609
4.936911 12.238133
Sorted values, these are not the actual comparison runs, here is raw data https://www.dropbox.com/s/fwhw3cxir18zz2h/problem-data.zip?dl=0
While we could run this perf:library
test multiple times to see if results agree, I want to be able to only have to run this task once and get a fairly confident ๐ or ๐ . My question is essentially: Could we massage all of this data in a way so that the -29.1292%
result is discredited (or known to be suspicious)? Or flagged somehow.
One thing that stood out to me here is that "old" has a variance of 0.16 while "new" has a variance of 3.5. While it's okay to say that newer is slower, we cannot conclusively say that it is 9.4% slower (which is a lot).
Here's the "good" result set to compare to: https://www.dropbox.com/s/awlb395xnsl8f6g/derailed-11-runs.zip?dl=0 (with the "bad" data also annotated).
Here's a spreadsheet showing distribution of the "bad" data here https://docs.google.com/spreadsheets/d/19W5PP656bns9K7Oo4bPcu-F7K4IVn0mS306sih9IBMk/edit?usp=sharing
Talked with @noahgibbs and I think the next step to take is to remove the "stop on statistical significance" feature from this code path.
Here's what can happen, the code is really noisy but bounces into significance, so we stop comparing, but if I had let it run it might have bounced back in.
Rather than making the developer decide ahead of time how many samples they want before beginning the comparison process we should change to running the full duration of the test by default and instead of outputting the statistical significance every X iterations, for example, every 50 iterations. If a developer is trying to literate and needs faster results they can accept the intermediate outputs as a weak indicator that their efforts are working. If they want to know the "real" result, then let the whole thing run.
A "best" solution would be to also run the same A/B test against the same test twice to really make sure that it doesn't accidentally give us statistically significant results as well. Right now we have a codepath to guard against this change, but we could consider adding an env var to disable it when desired.
This was "solved" by adding histograms.