roofbench

Benchmark utility for CPU FLOPS, core latency, and memory bandwidth.

Building

Dependencies:

Linux
GCC 11.0 or Clang 11.0 or newer (Clang is preferred)
- libomp-dev LLVM OpenMP Runtime Library (if using Clang)
- libc++-dev LLVM C++ standard library (if using -stdlib=libc++)
- lld LLVM linker (if using -fuse-ld=lld)
Meson build system
numactl or libnuma-dev

export AR=gcc-ar CC=gcc CXX=g++ RANLIB=gcc-ranlib
meson setup builddir -D simd_batch_size_f32=232 -D simd_batch_size_f64=116
ninja -C builddir

export AR=llvm-ar CC=clang CXX=clang++ RANLIB=llvm-ranlib
export CXXFLAGS=-stdlib=libc++ LDFLAGS='-fuse-ld=lld -stdlib=libc++'  # Optional
meson setup builddir -D simd_batch_size_f32=240 -D simd_batch_size_f64=120
ninja -C builddir

The build system uses -march=native by default, so the binary will be optimized for your specific machine.

Intel AVX-512

Turning on 512-bit SIMD can increase peak FLOPS on Intel CPUs. However, in a multitasking environment, the performance of other processes will be reduced.

export AR=gcc-ar CC=gcc CXX=g++ RANLIB=gcc-ranlib
meson setup builddir -D cpp_args=-mprefer-vector-width=512 -D simd_batch_size_f32=464 -D simd_batch_size_f64=232 --wipe
ninja -C builddir

export AR=llvm-ar CC=clang CXX=clang++ RANLIB=llvm-ranlib
export CXXFLAGS=-stdlib=libc++ LDFLAGS='-fuse-ld=lld -stdlib=libc++'  # Optional
meson setup builddir -D cpp_args="$CXXFLAGS -mprefer-vector-width=512" -D simd_batch_size_f32=480 -D simd_batch_size_f64=240 --wipe
ninja -C builddir

Optimal SIMD batch size

The optimal value is: (total SIMD register count − occupied count) × (SIMD lane width) ÷ sizeof (float).

Compiler	AArch64 NEON (128-bit)	AVX2 (256-bit)	AVX-512 (512-bit)
GCC	120, 60	464, 232	232, 116
Clang	120, 60	240, 120	480, 240

Running

OMP_PLACES=threads OMP_PROC_BIND=true ./builddir/roofbench | tee results.json
./plot_latency.py results.json > latency.svg

The output is in JSON format.

Included benchmarks

Affinity: shows thread affinity
Float Add: floating-point add operations
Float Mul: floating-point multiply operations
Float FMA: fused floating-point multiply then add operations
Memory Read: reading the corresponding NUMA local memory
Inter-thread Latency: round-trip time between each pair of host thread and guest thread, through shared memory communication on host thread’s NUMA node

Units of measurement

Time duration: seconds
FLOPS: operations per second
Throughput: bytes per second
Latency: seconds

License

The program is free and open-source software, licensed under the MIT license.

Refer to the LICENSE file for more information.

m13253/roofbench