Configure benchmark machine for maximal stability

Question

Configure benchmark machine for maximal stability

lrytz opened this issue 8 years ago · 16 comments

lrytz commented 8 years ago

Disable hyper-threading

In BIOS
Some sources mention the noht kernel parameter, but others say it doesn't work.
echo 0 > /sys/devices/system/cpu/cpuN/online for all N that don't have their own core id in cat /proc/cpuinfo
https://serverfault.com/questions/235825/disable-hyperthreading-from-within-linux-no-access-to-bios

NUMA

The machine only has a single NUMA node, so we don't need to worry about it.

http://stackoverflow.com/questions/11126093/how-do-i-know-if-my-server-has-numa

scala@scalabench:~$ sudo dmesg | grep -i numa
[    0.000000] No NUMA configuration found

scala@scalabench:~$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3

Use cpu sets

Install cset: sudo apt-get install cpuset. (On NUMA machines, cset also handles sets of memory nodes, but we only have one.)

cset set to create, manipulate CPU sets
cset proc to mange processes into sets
cset shield is convenience, simpler to use, allows isolating a process

Shielding

cset shield shows the current status
cset shield -c 1-3
- creates 3 sets: "root" with all CPUs, "user" with CPUs 1-3 (the "shield"), and "system" with the other CPUs.
- userspace processes in root are moved to system
cset shield -k on moves kernel threads (those that can be moved) from root to system (some kernel threads are specific to a CPU and not moved)
cset shield -v -s / -u show shielded / unshielded processes
cset shield -e cmd -- -cmdArg execute cmd -cmdArg in the shield
cset shield -r reset the shield

References

Use isolated CPUs

NOTE: Using isolated CPUs for running the JVM is not a good idea. The kernel doesn't do any load balancing across isolated CPUs. https://groups.google.com/forum/#!topic/mechanical-sympathy/Tkcd2I6kG-s, https://www.novell.com/support/kb/doc.php?id=7009596. Use cset instead of isolcpus and taskset.

lscpu --all --extended lists CPUs, also logical cores (if hyper-threading is enabled). The CORE column shows the physical core.

Kernel parameter isolcpus=2,3 removes CPUs 2 and 3 from the kernel's scheduler.

In /etc/default/grub, for example GRUB_CMDLINE_LINUX_DEFAULT="quiet isolcpus=2,3"
sudo update-grub

Verify

cat /proc/cmdline
cat /sys/devices/system/cpu/isolated
taskset -cp 1-- affinity list of process 1
ps -eww --forest -o pid,ppid,psr,user,stime,args -- there should be nothing on isolated cores.

Use taskset -c 2,3 <cmd> to run cmd (and child processes) only on CPUs 2 and 3.

Questions

Running on fewer cores probably impacts performance as the JVM runs compilation and GC concurrently.
When using taskset -c 2,3, does the JVM still think the system has 4 cores? Would that be a problem?

$ taskset -c 0,1 ~/scala/scala-2.11.8/bin/scala -e 'println(Runtime.getRuntime().availableProcessors())'
2
$ taskset -c 1 ~/scala/scala-2.11.8/bin/scala -e 'println(Runtime.getRuntime().availableProcessors())'
2

References

Tickless / NOHZ

Disable scheduling clock interrupts on the CPUs used for benchmarking, add the nohz_full=2,3 kernel parameter if there's a single task (thread) on the CPU.

Verify

cat /sys/devices/system/cpu/nohz_full
dmesg|grep dyntick should show the CPUs
sudo perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 stress -t 1 -c 1 should show 1 tick (see redhat reference)
- On my test system (after building a kernel with CONFIG_NO_HZ_FULL), i got numbers between 20 and 90 ticks on the otherwise idle CPU 1. Running on CPU 0, I get ~390 ticks.
- watch -n 1 -d grep LOC /proc/interrupts shows 1 tick per second on CPU 1 when idle
- Running anything stress -t 1 -c 1 on CPU 1 causes more ticks
- Running the scala REPL on CPU 1 causes more ticks whenever the RPEL is not idle

NOTE: disabling interrupts has some effect on CPU frequency, see https://fosdem.org/2017/schedule/event/python_stable_benchmark/ (24:45). Make sure to use a fixed CPU frequency. I don't have the full picture yet, but its something like that: the intel_pstate driver is no longer notified and does not update the CPU frequency.

Therefore: disable intel_pstate when using tickless mode
https://bugzilla.redhat.com/show_bug.cgi?id=1378529

(Some more advanced stuff in http://www.breakage.org/2013/11, pin some regular tasks to specific CPUs, writeback/cpumask, writeback/numa).

References

rcu_nocbs

RCU is a thread synchronization mechanism. RCU callbacks may prevent a cpu from entering adaptive-tick mode (tickless with 0/1 tasks). https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt

The rcu_nocbs=2,3 kernel param prevents CPUs 2 and 3 from queuing RCU callbacks.

References

https://en.wikipedia.org/wiki/Read-copy-update
Mentioned on http://stackoverflow.com/questions/20133523/how-do-i-get-tickless-kernel-to-work-nohz-full-rcu-nocbs-isolcpus-what-else
https://fosdem.org/2017/schedule/event/python_stable_benchmark/ (6:30): "I don't know the details, but the idea is that it will not spawn kernel code on this CPU"
https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt

Interrupt handlers

Avoid running interrupt handlers on certain CPUs

/proc/irq/default_smp_affinity is the default bit mask of CPUs permitted for an interrupt handle
/proc/irq/N/ contains smp_affinity (bit mask of allowed CPUs) and smp_affinity_list (list of CPUs able to execute the interrupt handler)

Verify

cat /proc/interrupts

There's an irqbalance service (systemctl status irqbalance)

https://www.novell.com/support/kb/doc.php?id=7007602: make sure to disable irqbalance when pinning irq handlers to certain processors
https://serverfault.com/questions/513807/is-there-still-a-use-for-irqbalance-on-modern-hardware: "You should use irqbalance unless You are manually pinning your apps/IRQ's to specific cores for a very good reason"

References

CPU Frequency

Disable Turbo Boost

In BIOS
Or write 1 to /sys/devices/system/cpu/intel_pstate/no_turbo -- if using pstate
- with intel_pstate=disable, find out how to disable turbo boost it in the system

There seem to be two linux tools

cpufrequtils, with cpufreq-info and cpufreq-set (https://wiki.debian.org/HowTo/CpuFrequencyScaling), used by krun
cpupower (https://wiki.archlinux.org/index.php/CPU_frequency_scaling) - for debian jessie that only exists in backports
It seems cpupower is actively developed and has more features, support for newer cpus (https://bbs.archlinux.org/viewtopic.php?id=135820)

Intel can run in different P-States, voltage-frequency pairs when running a process. C-States are idle / power saving states. The intel_pstate driver handles this.

The intel_pstate=disable kernel argument disables the intel_pstate driver and uses acpi-cpufreq instead (see redhad reference).

sudo apt-get install linux-cpupower (in jessie backports only!)
cpupower frequency-info and cpupower idle-info to show the active drivers.

CPU Info

lscpu
cat /proc/cpuinfo (| grep MHz)
cpupower frequency-info
watch -n 1 grep \"cpu MHz\" /proc/cpuinfo

CPUfreq Governors

List available governor: cpupower frequency-info --governors (Examples: performance, powersave, ...). Should use performance, which keeps the maximal frequency. NOTE: the intel_pstate driver still does dynamic scaling in this mode.
Check active governors: cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Set governor: cpupower -c 1-3 frequency-set --governor [governor] (on CPUs 2, 3)

Set a specific frequency:

sudo cpupower -c 1-3 frequency-set -f 2400MHz. Use -u for max, -d for min.
- This activates the userspace cpu governor
Valid frequencies: cpupower frequency-info
Which frequency? Dmitry suggested to under-clock.
Does not work with the intel_pstate driver (http://stackoverflow.com/questions/23526671/how-to-solve-the-cpufreqset-errors)

The intel_pstate driver has /sys/devices/system/cpu/intel_pstate/min_perf_pct and max_perf_pct, maybe these can be used if we stick with that driver?

References

Disable git gc

https://stackoverflow.com/questions/28092485/how-to-prevent-garbage-collection-in-git

$ git config --global gc.auto 0

Disable hpet

Suggested by Dmitry, I haven't found any other references.

hpet is a hardware timer with a frequency of at least 10 MHz (higher than older timer circuits).

Current source: cat /sys/devices/system/clocksource/clocksource0/current_clocksource
Available sources: cat /sys/devices/system/clocksource/clocksource0/available_clocksource

Change using a kernel parameter clocksource=acpi_pm

Explanation of clock sources: https://access.redhat.com/solutions/18627

References

Ramdisk

tmpfs vs ramfs

ramfs: older, cannot set size limit
tmpfs: can set size limit, may swap if system runs out of memory
https://www.jamescoyle.net/knowledge/951-the-difference-between-a-tmpfs-and-ramfs-ram-disk

Added to /etc/fstab

tmpfs /mnt/ramdisk tmpfs defaults,size=16g 0 0

Disable "transparent hugepages"

There are some recommendations out there to disable "transparent hugepages", mostly for database servers

Disable `khungtaskd`

Probably not useful, runs every 120 seconds only. Detects hung tasks.

Cron jobs

https://help.ubuntu.com/community/CronHowto

User crontabs: crontab -e to edit, crontab -l to show
Show all user crontabs: for user in $(cut -f1 -d: /etc/passwd); do sudo crontab -u $user -l; done. Or make sure that the /var/spool/cron/crontabs directory is empty.
System crontab: /etc/crontab - should not edit by hand
/etc/cron.d contains files with system crontab entries
/etc/cron.hourly / .daily / .monthly / .weekly contain scripts executed from /etc/crontab (or by anacron, if installed)

Disable / enable cron

systemctl stop cron
systemctl start cron

Disable / enable at

systemctl stop atd
systemctl start atd

Run under perf stat

Suggestion by Dmitry, discard benchmarks with too many cpu-migrations, context-switches. Would need to keep track of expected values.

sudo perf stat -x, scalac Test.scala (machine-readable output)
-prof perfnorm in jmh

References

Build custom kernel

Ah well, probably have to figure out some more details how to do this correctly.

apt-get install linux-source-4.9
tar xaf /usr/src/linux-source-4.9.tar.xz

apt-get install build-essential fakeroot libncurses5-dev

cd linux-source-4.9
cp /boot/config-4.9.0-0.bpo.2-amd64 .config
make menuconfig
  - General setup->Timers subsystem->Timer tick handling -> Full dynticks system (tickless)
  - Up one level -> Full dynticks system on all CPUs by default (except CPU 0)
  - General setup->Local Version, enter a simple string
nano .config
  - comment out CONFIG_SYSTEM_TRUSTED_KEYS
    https://unix.stackexchange.com/questions/293642/attempting-to-compile-any-kernel-yields-a-certification-error

make deb-pkg

cd ..
sudo dpkg -i linux-image-4.9.18_4.9.18-1_amd64.deb

Scripting all of that

It seems that python3's "perf" package will do most configurations:

https://perf.readthedocs.io/en/latest/system.html#system

pip3 install perf
python3 -m perf system show
python3 -m perf system tune
python3 -m perf system reset

Important: check all settings before starting a benchmark.

Check load

Find a way to ensure that the benchmark machine is idle before starting a job.

Machine Specs

NX236-S2HD (http://www.nixsys.com/nx236-s2hd.html)

Motherboard: X11SSZ https://www.supermicro.com/manuals/motherboard/C236/MNL-1744.pdf
Intel C236 Chipset
Intel Core i7-6700 (4 Core, 8 Thread)
64GB (4x 16GB) DDR4 PC17000 (2133MHz)
WD Black 2TB (WD2003FZEX)
Samsung 850 Pro 512GB

Answer 1 · 2017-03-20T12:57:19.000Z

I seem to remember someone (@adriaanm?) suggesting our script could trigger a reboot and then run the actual benchmark during the shutdown or startup sequence, at a point when superfluous services aren't running and when other users can't log in.

We could still use the Jenkins SSH Slave functionality to set all this up, but we'd have to add a custom build step to poll for completion.

Answer 2 · 2017-03-20T13:23:27.000Z

I could imagine that during startup / shutdown or right after startup the system might schedule maintenance tasks and not be the most stable either.

We should definitely check if there's a difference if we don't use a jenkins slave / ssh connection.

Answer 3 · 2017-03-20T15:12:13.000Z

Several more suggestions based on my experience:

use performance cpufreq governor;
but under-clock the cpu. This will ensure that it never throttles and has reliable performance;
disable hpet;
replace cron with anacron and disable it before test execution.

Answer 4 · 2017-03-20T15:15:06.000Z

Since I've switched to ssd, they can have periodic maintenance that may slow down stuff.
Because of this I now use ram-disk for entire OS during benchmarking. I don't think you need to go so extreme as I did, but moving the working directory & ivy cache into ramdisk may be a good idea.

Answer 5 · 2017-03-20T15:17:49.000Z

one more idea, that I came up with but didn't have time to try out:
always run the entire vm under perf stat java .... and disqualify the tests if there has been to many cpu-migrations\context-switches.

Answer 6 · 2017-03-21T03:09:51.000Z

I've added a script (~/bin/setup-benchmark.sh) that is run before the benchmarks (with sudo) that:

disables hyperthreading (taking cores 12-23 offline)
disables turbo boost
enables the "performance" cpu frequency scaling governor with min/max frequency of 2000MHz

The last part appears to be ignored, though, running:

% watch grep \"cpu MHz\" /proc/cpuinfo

Shows the frequencies scaling back and forth between 1200 and 2400.

I'm still seeing larger-than-expected variance in the runs.

Given:
https://serverfault.com/questions/716317/linux-why-does-the-cpu-frequency-fluctuate-when-using-the-performance-governor
https://wiki.archlinux.org/index.php/CPU_frequency_scaling

Another step might be to disable the pstate driver, but this gets a little beyond my comfort zone on a box that I don't have a keyboard and monitor attached too...

Answer 7 · 2017-03-21T03:13:55.000Z

This appears to be a pretty comprehensive guide to setting up stable benchmark environments:

https://perf.readthedocs.io/en/latest/system.html#system
https://haypo.github.io/journey-to-stable-benchmark-system.html

Answer 8 · 2017-03-21T05:26:12.000Z

Also interesting, Virtual Machine Warmup Blows Hot and Cold

In order to control as many
of these as possible, we wrote Krun, a new benchmark runner.
Krun itself is a ‘supervisor’ which, given a configuration
file specifying VMs, benchmarks (etc.) configures a Linux
or OpenBSD system, runs benchmarks, and collects the results.

Krun uses cpufreq-set to set the CPU governor to
performance mode (i.e. the highest non-overclocked frequency
possible). To prevent the kernel overriding this setting,
Krun verifies that the user has disabled Intel P-state
support in the kernel by passing intel pstate=disable
as a kernel argument

https://github.com/softdevteam/krun
http://soft-dev.org/talks/2017/warmup_smal.pdf slides from a presentation

Answer 9 · 2017-03-21T05:29:34.000Z

Therefore, before each process execution (including
before the first), Krun reboots the system, ensuring
that the benchmark runs with the machine in a (largely)
known state. After each reboot, Krun is executed by the init
subsystem; Krun then pauses for 3 minutes to allow the system
to fully initialise; calls sync (to flush any remaining
files to disk) followed by a 30 second wait; before finally
running the next process execution.

Answer 10 · 2017-05-15T11:53:44.000Z

I did a few experiments with isolcpus and taskset. I ran hot -p source=scalap -wi 20 -i 10 -f 1 across various configurations.

Without isolcpus:

no taskset: 1249.039 ± 5.852 ms/op
taskset -c 1-3: 1235.922 ± 4.144 ms/op
taskset -c 2,3: 1258.894 ± 7.385 ms/op
taskset -c 1 (similar for -c 2): 1625.414 ± 63.652 ms/op

One possible explanation could be that GC causes jitter when there's only one processor available, as it cannot run in parallel.

With isolcpus=1-3

no taskset: 1583.593 ± 51.703 ms/op
- This is maybe similar to the 1 CPU version above
taskset -c 1-3: 1347.184 ± 43.096 ms/op
- Here I don't know why the variance is large

With isolcpus=2,3

no taskset: 1272.709 ± 5.543 ms/op
taskset -c 0,1: 1261.332 ± 5.239 ms/op
taskset -c 1,2: 1414.475 ± 48.747 ms/op
taskset -c 2,3: 1372.691 ± 48.531 ms/op

The large variances when using taskset on the isolated CPUs are surprising.

Answer 11 · 2017-05-15T14:03:33.000Z

I added -prof perfnorm to the jmh command for the isolcpus=2,3 case.

taskset -c 0,1: cpu-migrations 3.325 #/op, page-faults 1696.013 #/op
taskset -c 2,3: cpu-migrations doesn't appear in the log, page-faults 3077.282 #/op

Answer 12 · 2017-05-15T19:48:20.000Z

It makes sense now: when using taskset to move a process on an isolated cpu, the kernel doesn't do any load balancing across CPUs. https://groups.google.com/forum/#!topic/mechanical-sympathy/Tkcd2I6kG-s, https://www.novell.com/support/kb/doc.php?id=7009596. started reading about cpuset, will experiment.

Answer 13 · 2017-05-18T12:57:59.000Z

Added a script that checks the machine state and sets some of the configurations discussed in the main description of this issue (https://github.com/scala/compiler-benchmark/blob/master/scripts/benv)

I ran some experiments in various configurations

$ sbt 'export compilation/jmh:fullClasspath' | tail -1 | tee compilation/cp'
$ cd compilation
$ java -cp $(cat cp) org.openjdk.jmh.Main HotScalacBenchmark -p source=scalap

I didn't do multiple runs to see the how much the error values vary. The error numbers are probably too close together / jittery to make a meaningful comparison, but I'm trying anyway.

Config	Result	Error/Score*1000
clean	1242.208 ± 5.331	4.29
clean, through sbt (`sbt 'hot -p source=scalap'`)	1256.471 ± 4.734	3.77
some services stopped (atd, acpid, dbus, irqbalance, rsyslogd)	1235.294 ± 3.799	3.08
CPU frequency fixed to 3400 MHz	1259.373 ± 5.872	4.66
CPU frequency fixed to 2000 MHz	2089.546 ± 9.279	4.44
CPU shield (1-3) (*)	1274.204 ± 5.806	4.56
interrupt affinities set to 1	1242.420 ± 4.473	3.60

(*) sudo cset shield sudo -- -u scala java -cp $(cat cp) org.openjdk.jmh.Main HotScalacBenchmark -p source=scalap

Running through sbt doesn't seem to be a problem, but maybe it's still better to run directly
Stopping services and setting interrupt affinities might help
Fixing the CPU frequency at 3400 (max) might increase variance (?)
Fixing the CPU frequency at 2000 leads to much fewer data points in the given time, so there's no win. We'd have to run for longer, but then we can also run longer on the higher frequency.
Shielding doesn't seem to help. Here it doesn't seem to hurt either. In general, while working on this, I never saw a good improvement with shielding. In my observations it rather seemed to increase variance.

In combination

Enabling all configuration (services, 3400 MHz, shield, interrupts): 1273.223 ± 5.779
Enabling all but the shield: 1270.649 ± 5.161
Only servces and interrupts: 1238.750 ± 4.609

Again, the error numbers are not stable enough to make a useful conclusion.

Answer 14 · 2017-05-18T13:21:16.000Z

For comparison I ran a simple benchmark that creates a new Global (https://github.com/scala/compiler-benchmark/compare/master...lrytz:newGlobal?expand=1).

sbt 'compilation/jmh:run NewGlobalBenchmark -wi 5 -i 10 -f 3'

clean state: 189.892 ± 0.253
with all configs in the script, and running in a shield: 188.785 ± 0.227

One thing that jumps out is that variances are much more stable between iterations than what we're seeing when running the entire compiler. In the compiler we always see things like

Iteration   1: 1253.048 ±(99.9%) 7.425 ms/op
Iteration   2: 1243.611 ±(99.9%) 38.322 ms/op
Iteration   3: 1232.193 ±(99.9%) 26.320 ms/op
...

For NewGlobalBenchmark,

[info] Iteration   1: 187.737 ±(99.9%) 1.405 us/op
[info] Iteration   2: 187.800 ±(99.9%) 1.408 us/op
[info] Iteration   3: 187.975 ±(99.9%) 1.648 us/op
[info] Iteration   4: 187.794 ±(99.9%) 1.381 us/op
...

Maybe the IO has an impact here. I'll experiment a bit with -Ystop-after and with using a ramdisk.

Answer 15 · 2017-05-18T13:24:46.000Z

Actually, of course the number of benchmarks invocations is much higher for NewGlobalBenchmark (I got 789350) compared to HotScalacBenchmark (260).

Answer 16 · 2017-05-19T14:58:42.000Z

Using a ramdisk (for the compiler-benchmark checkout, the benchmarked compiler's output directory, and the ivy cache containing all jars, including the compiler), and with the benchmark config (stop services, 3400 MHz, interrupt affinity, but without the CPU shield): 1223.810 ± 5.396 ms/op. This is a bit faster than what I saw on the SSD (1270.649 ± 5.161), but the variance is the same.

I also ran with -Ystop-before:jvm

ramdisk: 1032.061 ± 2.627
ssd: 1051.952 ± 2.942

This suggests that IO could be a cause of variance, but the ramdisk doesn't help to reduce it.

Disable hyper-threading

NUMA

Use cpu sets

Use isolated CPUs

Tickless / NOHZ

rcu_nocbs

Interrupt handlers

CPU Frequency

Disable git gc

Disable hpet

Ramdisk

Disable "transparent hugepages"

Disable khungtaskd

Cron jobs

Run under perf stat

Build custom kernel

Scripting all of that

Check load

Machine Specs

Disable `khungtaskd`