bencheeorg/benchee

MacOS reports it has nano-second precision, but it most often doesn't --> Bad nano benchmark results

PragTob opened this issue · 5 comments

Finally got around to verifying this once I got a hand on a Mac, I'll add more details from my own computer later but the gist is this:

Problem

In whatever circumstance that I don't understand :erlang.monotonic_time()only answers in microsecond precision while acting like nanosecond. The impact is that we get measurements that aren't precise enough and as a result are reporting weird measurements/medians of 0ns or what not.

Here you can see the monotonic_time()behaviour changing, between answering in thousands (microseconds) and precise (nanoseconds)

iex(20)> :erlang.monotonic_time()
-576460662355025000
iex(21)> :erlang.monotonic_time()
-576460661060082000
iex(22)> :erlang.monotonic_time()
-576460660057118000
iex(23)> :erlang.monotonic_time()
-576460658439012000
iex(24)> :erlang.monotonic_time()
-576460657486274000
iex(25)> :erlang.monotonic_time()
-576460656358455000
iex(26)>
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
^CJalynas-MacBook-Pro-2:strings_bench tobi$ iex
Erlang/OTP 22 [erts-10.5.2] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

Interactive Elixir (1.9.2) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :erlang.monotonic_time()
-576460750092774257
iex(2)> :erlang.monotonic_time()
-576460748033749614
iex(3)> :erlang.monotonic_time()
-576460746882032373
iex(4)> :erlang.monotonic_time()
-576460745953850985
iex(5)> :erlang.monotonic_time()
-576460744563625901
iex(6)>
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
^CJalynas-MacBook-Pro-2:strings_bench tobi$ iex -S mix
Erlang/OTP 22 [erts-10.5.2] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

Interactive Elixir (1.9.2) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :erlang.monotonic_time()
-576460749655663000
iex(2)> :erlang.monotonic_time()
-576460748341135000
iex(3)> :erlang.monotonic_time()
-576460746837166000
iex(4)>
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
^CJalynas-MacBook-Pro-2:strings_bench tobi$ iex
Erlang/OTP 22 [erts-10.5.2] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

Interactive Elixir (1.9.2) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :erlang.monotonic_time()
-576460748654849000
iex(2)> :erlang.monotonic_time()
-576460747093918000

and yes, the native time unit claims to be :nanosecond (no change when converting native to nanosecond):


iex(7)> :erlang.convert_time_unit(1000, :native, :nanosecond)
1000
iex(8)> :erlang.monotonic_time()
-576460691963568000
iex(9)> :erlang.monotonic_time()
-576460691066966000

as a result our measurements can look like this and produce very odd results:

0
2000
1000
0
1000
1000
1000
1000
1000
0
1000
0
0
1000
0
1000
0
1000
1000
1000
0
1000
0
1000
1000
0
1000
1000

Name             ips        average  deviation         median         99th %
iodata        1.56 M      642.21 ns   ±330.72%        1000 ns        3000 ns
string        1.39 M      720.55 ns  ±1535.92%           0 ns        3000 ns

Comparison:
iodata        1.56 M
string        1.39 M - 1.12x slower +78.34 ns

Background

See this twitter thread, basically I noticed odd looking benchmarking results in this blog post (a median of 0ns). The benchmark can be found here

Further Debugging

After looking into the Erlang Time Correction docs I hoped that the os_monotonic information might be different but apart from the bad time values it seems to be the same between Mac and my linux machine:

# example from my Linux system, but apart from the time Mac results look the same
iex(2)> :erlang.system_info(:os_monotonic_time_source)
[
  function: :clock_gettime,
  clock_id: :CLOCK_MONOTONIC, 
  resolution: 1000000000,
  extended: :no,
  parallel: :yes,
  time: 5688955479888
]

I also investigated time_warp_mode but results on the mac didn't change.


Opened an issue at Erlang/OTP: https://bugs.erlang.org/browse/ERL-1067

Possible Solutions

  • explicitly don't trust macos and treat it as we'd treat microsecond measurements (repetition and all)
  • make the repeated measurement code more intelligent and "doubtful" when it returns multiples of a thousand and then rather retry and see if it ends up divisble by a thousand again
  • your idea here

Workarounds possible now

  • run benchmarks on linux
  • manually integrate repetition to go over a threshold of ~10 microseconds

Environment

Values obtained from a MBP from ~2014/2015 running 10.14.2 Mojave on an i5 processor

Is this a benchee bug?

Yes and No. Yes the results reported in the end aren't accurate, however I'd say that this time reporting with another precision than reported is a bug in the BEAM or maybe MacOS or I dunno what.

We still need to fix it/make it better as lots of people out there are doing benchmarks on MacOS.

Can I help?

If you have a mac you can run the benchmark and post your results: https://github.com/oestrich/strings_bench

If you could run an iex shell and run :erlang.monotonic_time() a couple of times, kill the session and run it again seeing if the precision changes as seen above, that'd be nice. I'd like to see if this happens all the time/for everything or just specific configurations.

Confirmed it's a bug on erlang at least with the reported resolution.

Alrighty. After all of this time I'm here, I'm actually looking at it. Also, sorry for the time but well.. life and stuff. Fuck 2020.

Anyhow.

Investigating this is worse than I anticipated. I initially managed to reproduce but now seemingly can't... I only switched a couple of things around (different formatter) but now... it seems to be working fine? Not sure if it's the mood of the os or whatever.

I changed the bottom of the original script like this:

bench_fun = fn -> View.render("iodata-long", %{character: character}) end

Benchee.run(
  %{
    "iodata" => bench_fun,
  },
  formatters: [
    {Benchee.Formatters.Console, extended_statistics: true}
  ]
)

alias Benchee.Benchmark.Collect.NativeTime

fun = fn _i -> {time, _} = NativeTime.collect(bench_fun); time end

1..1000
|> Enum.map(fun)
|> Enum.take(-50)
|> IO.inspect()

1..1000
|> Enum.map(fun)
|> Enum.take(-50)
|> IO.inspect()

and the output I get is quite fine and looks okay:

Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
Number of Available Cores: 12
Available memory: 32 GB
Elixir 1.10.3
Erlang 22.3

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 7 s

Benchmarking iodata...

Name             ips        average  deviation         median         99th %
iodata        1.34 M      746.69 ns  ±3536.53%         494 ns        1973 ns

Extended statistics:

Name           minimum        maximum    sample size                     mode
iodata          419 ns    13294633 ns         1.13 M                   467 ns
[1969, 1998, 1995, 2024, 3585, 1982, 2034, 2081, 1968, 3950, 1996, 2100, 3756,
 2047, 1990, 1990, 2105, 2088, 1961, 2021, 3814, 2015, 2052, 1976, 1962, 2006,
 1980, 3812, 1985, 1975, 2024, 1961, 1989, 1963, 1958, 3849, 1977, 1993, 2019,
 1973, 1975, 2034, 1956, 3849, 1975, 2003, 1999, 1988, 1986, 1999]
[3568, 1981, 2003, 1982, 1992, 1976, 2041, 2013, 3829, 3557, 2303, 1973, 1996,
 1981, 2004, 1997, 3775, 2017, 1997, 1988, 2001, 1979, 2045, 4152, 1990, 1989,
 2007, 1980, 1965, 1982, 2020, 4032, 1983, 1957, 1982, 1965, 1996, 1973, 2010,
 3780, 1986, 1991, 1993, 1980, 1984, 1982, 1997, 3800, 1966, 1978]

Albeit earlier I got this:

image

Not sure what gives.

which now... makes me think that maybe after all it is the function call overhead measurements that are messing with things.

mezz commented

Hello, I'm just running into this issue today on a 2018 15" Macbook Pro.

I am running Benchee 1.0.1 and

Elixir 1.11.4
Erlang 23.3.1

I am getting graphs like the ones posted here, with many 0us or 1000ns results.

It looks like the Erlang issue is resolved in my version of Erlang.

Monotonic time results are consistent for me across 5 iex sessions in a row, it looks like I have a lower resolution than the OP:

iex(8)> :erlang.system_info(:os_monotonic_time_source)
[
  function: :clock_gettime,
  clock_id: :CLOCK_MONOTONIC,
  resolution: 1000000,
  extended: :no,
  parallel: :yes,
  time: 494623835664000
]
iex(9)> :erlang.monotonic_time()
-576460731659128000
iex(10)> :erlang.monotonic_time()
-576460731045460000
iex(11)> :erlang.monotonic_time()
-576460730387654000

Here are my results for strings_bench using your edits to the script:

Click to expand
Operating System: macOS
CPU Information: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
Number of Available Cores: 12
Available memory: 32 GB
Elixir 1.11.4
Erlang 23.3.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 7 s

Benchmarking iodata...

Name             ips        average  deviation         median         99th %
iodata        2.00 M      500.44 ns  ±5517.56%           0 ns        1000 ns

Extended statistics:

Name           minimum        maximum    sample size                     mode
iodata            0 ns    33122000 ns         4.48 M                     0 ns
[0, 0, 3000, 1000, 0, 0, 1000, 1000, 0, 5000, 0, 0, 1000, 1000, 0, 0, 1000,
 2000, 1000, 1000, 0, 1000, 1000, 0, 0, 2000, 0, 1000, 1000, 0, 1000, 1000, 0,
 2000, 1000, 0, 0, 1000, 1000, 0, 2000, 1000, 0, 0, 1000, 0, 0, 1000, 2000, 0]
[1000, 2000, 1000, 1000, 0, 0, 1000, 1000, 0, 2000, 0, 1000, 1000, 0, 0, 1000,
 2000, 0, 1000, 1000, 0, 0, 1000, 1000, 2000, 1000, 0, 0, 1000, 1000, 0, 0,
 3000, 0, 0, 1000, 1000, 0, 0, 1000, 2000, 1000, 1000, 0, 1000, 1000, 0, 2000,
 0, 0]

Please let me know if I can assist with this further, I'm glad to help out.

This was FINALLY solved in #339