About
I wrote these benchmarks for a presentation on "Performance Tips, Tricks, and Gotchas". They contain benchmarks to compare several ways of doing the same thing in C++ that are subtly different on the surface but may differ significantly in terms of performance. Writing these was an interesting learning opportunity for me, because I learned how to write these benchmarks in the process of doing it, and though I already knew that in principal there were performance differences between these things, I'd never actually taken the time to measure them.
Benchmarks include measurements for:
- Function call overhead: Virtual member function vs. non-virtual member function vs. lambda function vs. std::function
- Effects of data locality/cache misses
- False sharing between threads
- Using mutexes vs. atomics
This is a work in progress and there may be mistakes. There are also a few TODOs left in benchmarks.cpp that are worth paying attention to. I'll clean this up more in the following weeks.
Caveats
Never take benchmarks at face value and assume the results will always be the same. There's a lot of context that goes with each of these, which means that you might not want to always do the thing that appears fastest here. For example, in different situations the compiler is able to perform different optimizations, so something that looks "free" in the benchmark might turn out not to be free in your code - or something that seems faster in a microbenchmark might adversely impact performance in another way. The point of this is to understand different factors that might affect performance so that you can watch out for them and consider them - not to tell you what's right or what's wrong to use in every situation.
How to Install and Run
# Install conan. Used to fetch google benchmark.
sudo apt-get install python3-venv
python3 -m venv pyenv
source pyenv/bin/activate
pip install conan
# Configure conan.
conan profile new default --detect
conan profile update settings.compiler.libcxx=libstdc++11 default
mkdir build && cd build
# This will download google benchmark
conan install ..
# Configure cmake
cmake .. -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release
# Build benchmark
cmake --build .
# Run benchmark
./bin/benchmarks
Output on my machine:
Running ./build/bin/benchmarks
Run on (16 X 3396.7 MHz CPU s)
CPU Caches:
L1 Data 32K (x8)
L1 Instruction 32K (x8)
L2 Unified 1024K (x8)
L3 Unified 25344K (x1)
Load Average: 0.00, 0.11, 0.27
----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------
BM_virtualFunctionCallsThroughPointerToParent 2.51 ns 2.51 ns 294239104
BM_virtualFunctionCallsThroughPointerToChild 1.61 ns 1.61 ns 433945778
BM_virtualFunctionCallsThroughInstanceOfChild 0.295 ns 0.295 ns 1000000000
BM_nonVirtualNonInlineFunctionCall 3.24 ns 3.24 ns 215936970
BM_inlineFunctionCall 0.295 ns 0.295 ns 1000000000
BM_noFunctionCall 0.295 ns 0.295 ns 1000000000
BM_stdFunctionCall 1.77 ns 1.77 ns 395824197
BM_lambdaFunctionCall 0.295 ns 0.295 ns 1000000000
BM_stdFunctionPassedAsParameterFunctionCall 2.06 ns 2.06 ns 339063066
BM_lambdaPassedAsParameterFunctionCall 0.295 ns 0.295 ns 1000000000
BM_sequentialListAccess 1367 ns 1367 ns 440074
BM_sequentialArrayAccess 148 ns 148 ns 4704685
BM_sequentialArrayAccessSmallerThanL1 47.8 ns 47.8 ns 14637763
BM_randomArrayAccessSmallerThanL1 99.9 ns 99.9 ns 7034840
BM_sequentialArrayAccessBiggerThanL1 158070 ns 158063 ns 4429
BM_randomArrayAccessBiggerThanL1 727313 ns 727301 ns 839
BM_falseSharing/manual_time 2841579 ns 41720 ns 246
BM_noFalseSharing/manual_time 2144634 ns 39660 ns 326
BM_useMutex/manual_time 116323172 ns 50073 ns 6
BM_useMutexNoContention/manual_time 15982798 ns 36639 ns 44
BM_useAtomic/manual_time 28326920 ns 39443 ns 25