LilithHafner/Chairmarks.jl

benchmark results oscillate

matthias314 opened this issue · 11 comments

In this example, the timings returned by @b oscillate between two values, one of which agrees with @btime:

julia> using StaticArrays, BenchmarkTools, Chairmarks

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @b $v == $v
  9.522 ns (0 allocations: 0 bytes)
16.110 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @b $v == $v
  9.510 ns (0 allocations: 0 bytes)
9.763 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @b $v == $v
  9.515 ns (0 allocations: 0 bytes)
16.536 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @b $v == $v
  9.517 ns (0 allocations: 0 bytes)
9.764 ns

I only get this on my laptop, not on another machine. versioninfo() for my laptop:

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores

Packages:

Status `/tmp/jl_AUb4oc/Project.toml`
  [6e4b80f9] BenchmarkTools v1.5.0
  [0ca39b1e] Chairmarks v1.1.2
  [90137ffa] StaticArrays v1.9.3

Addition: For @btime I once got the higher number when I ran the benchmark for the first time:

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @b $v == $v
  16.658 ns (0 allocations: 0 bytes)
17.701 ns

but I cannot reproduce this.

Very interesting. Thank you for reporting even though it is hard to reproduce.

My initial guess is that this is that this is due to a random slowdown causing the autotuner to pick a significantly too low number of evaluations, which results in a longer reported runtime. This is something that both Chairmarks and BenchmarkTools are susceptible to, but Chairmarks more so due to less time spent tuning.

One possible solution is to verify that the tuning was reasonable using the final estimated runtime, and retry if not.


However, after a fair bit of experimentation, the closest I could get to reproducing this is

julia> v = zero(SVector{1,Int16}); @btime $v == $v; @be $v == $v seconds=.003
  1.916 ns (0 allocations: 0 bytes)
Benchmark: 56 samples with 8260 evaluations
min    1.993 ns
median 2.003 ns
mean   2.007 ns
max    2.406 ns

julia> v = zero(SVector{1,Int16}); @btime $v == $v; @be $v == $v seconds=.003
  2.041 ns (0 allocations: 0 bytes)
Benchmark: 4 samples with 7769 evaluations
       2.183 ns
       2.183 ns
       2.188 ns
       2.188 ns

julia> v = zero(SVector{1,Int16}); @btime $v == $v; @be $v == $v seconds=.003
  1.958 ns (0 allocations: 0 bytes)
Benchmark: 54 samples with 8256 evaluations
min    1.993 ns
median 1.999 ns
mean   2.006 ns
max    2.387 ns

julia> v = zero(SVector{1,Int16}); @btime $v == $v; @be $v == $v seconds=.003
  1.958 ns (0 allocations: 0 bytes)
Benchmark: 33 samples with 7773 evaluations
min    2.182 ns
median 2.187 ns
mean   2.197 ns
max    2.546 ns

Which is totally inconsistent with that hypothesis. What do you get when you change @b to @be?

This seems suspicious and related:

julia> for _ in 1:10
           v = zero(SVector{1,Int16});
           x = @be $v == $v seconds=.003
           println(sum(s -> s.time*s.evals, x.samples) / .003) # Efficiency
       end
0.005333333333333334
0.6315463333333339
0.9134653333333339
0.8869929999999999
0.8970749999999997
0.8211156666666665
0.8458399999999997
0.9160763333333332
0.7899360000000001
0.959712333333333

Perhaps it is because compile time is counted against the runtime budget resulting in significantly fewer samples being taken

What do you get when you change @b to @be?

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v
  17.638 ns (0 allocations: 0 bytes)
Benchmark: 2699 samples with 1725 evaluations
min    15.839 ns
median 16.171 ns
mean   16.095 ns
max    27.762 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v
  9.525 ns (0 allocations: 0 bytes)
Benchmark: 3421 samples with 2721 evaluations
min    9.763 ns
median 9.771 ns
mean   9.879 ns
max    22.681 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v
  9.514 ns (0 allocations: 0 bytes)
Benchmark: 3201 samples with 1564 evaluations
min    16.551 ns
median 18.191 ns
mean   18.113 ns
max    32.007 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v
  9.521 ns (0 allocations: 0 bytes)
Benchmark: 3410 samples with 2709 evaluations
min    9.766 ns
median 9.778 ns
mean   9.930 ns
max    23.148 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v
  9.515 ns (0 allocations: 0 bytes)
Benchmark: 3267 samples with 1561 evaluations
min    17.669 ns
median 17.701 ns
mean   17.861 ns
max    29.844 ns

Note that for @btime I again get the larger value once at the beginning.

Very strange. This is inconsistent with both the compile time idea and the tuning idea. To confirm that it is not tuning-related, what if you add keywords

v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v evals=2721 samples=3421

what if you add keywords

v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v evals=2721 samples=3421

Now I almost always get the larger value for @be -- and an oscillating behaviour for @btime!

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v evals=2721 samples=3421
  9.518 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    9.766 ns
median 10.036 ns
mean   10.170 ns
max    43.306 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v evals=2721 samples=3421
  10.576 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    15.828 ns
median 16.102 ns
mean   16.322 ns
max    45.574 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v evals=2721 samples=3421
  17.644 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    15.829 ns
median 16.258 ns
mean   16.347 ns
max    49.786 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v evals=2721 samples=3421
  10.573 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    15.829 ns
median 16.262 ns
mean   16.319 ns
max    49.806 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v evals=2721 samples=3421
  17.570 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    15.828 ns
median 16.266 ns
mean   16.718 ns
max    43.635 ns

julia> v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v evals=2721 samples=3421
  10.572 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    15.829 ns
median 16.019 ns
mean   16.339 ns
max    29.027 ns

wtf. Benchmarking is hard. I have no idea.

What if you put it in a function?

f() = begin v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v evals=2721 samples=3421 end

and then f() repeatedly.

What if you put it in a function?

julia> f() = begin v = zero(SVector{14,Int16}); @btime $v == $v; @be $v == $v evals=2721 samples=3421 end
f (generic function with 1 method)

julia> f()
  9.516 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    17.675 ns
median 17.692 ns
mean   18.746 ns
max    53.658 ns

julia> f()
  9.778 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    17.666 ns
median 17.688 ns
mean   18.915 ns
max    52.516 ns

julia> f()
  10.578 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    17.663 ns
median 17.688 ns
mean   18.602 ns
max    46.182 ns

julia> f()
  9.515 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    17.674 ns
median 17.689 ns
mean   18.355 ns
max    46.739 ns

julia> f()
  16.661 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    17.674 ns
median 17.688 ns
mean   18.145 ns
max    41.606 ns

julia> f()
  9.516 ns (0 allocations: 0 bytes)
Benchmark: 3420 samples with 2721 evaluations
min    17.674 ns
median 17.691 ns
mean   18.206 ns
max    38.266 ns

Maybe this cannot be debugged ...

I wonder if @jrevels has any ideas, seeing how this effects both BenchmarkTools and Chairmarks.

If you take out the call to BenchmarkTools are the Chairmark results still slow?

g() = begin v = zero(SVector{14,Int16}); @be $v == $v evals=2721 samples=3421 end

And then g() repeatedly

f you take out the call to BenchmarkTools are the Chairmark results still slow?

Yes, still slow with g(), both within the same Julia session as before and a fresh one.

I wish I could reproduce this on my machine, that would make debugging easier. I've tried unsuccessfully on

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 8 × unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, generic)
  Threads: 1 on 8 virtual cores

julia> versioninfo()
Julia Version 1.11.0-alpha1
Commit 671de9f5793 (2024-03-01 08:30 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 8 × unknown
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_EDITOR = code

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, nehalem)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
Environment:
  JULIA_EDITOR = code

Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 4 × Apple Silicon
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, k8-sse3)
  Threads: 1 on 4 virtual cores