This is a tiny C project I wrote to explore the caching behavior of multicore systems. I wrote it after reading about flat combining data structures, in order to determine when single-threaded execution outperforms multi-threaded execution at a low level on *nix systems.
By default, running make all
produces one program—inc
—which only
measures one thing: how long it takes to increment integers. For example, the
following invocation:
env OMP_NUM_THREADS=$THREADS ./inc $TIME $ITERS $INTS $STRIDE $RAND $ATOMIC
with variables set to:
THREADS=4
TIME=0.25
ITERS=15
INTS=2
STRIDE=3
RAND=0
ATOMIC=1
uses 4 OpenMP threads to atomically increment 2 integers 15 times each, where the integers are spaced 3 integer-lengths apart, and accessed in linear order. This is repeated until the total execution time exceeds 0.25 seconds, then the average nanoseconds per integer increment (minus loop overhead) is reported.
The memory layout of the above scenario looks like this:
0x000 | 0x001 | 0x002 | 0x003 | 0x004 | 0x005 | 0x006 | 0x007 | 0x008 | 0x009 | 0x00a | 0x00b | 0x00c | 0x00d | 0x00e | 0x00f | 0x010 | 0x011 | 0x012 | 0x013 | 0x014 | 0x015 | 0x016 | 0x017 |
Data | Padding | Padding | Data | Padding | Padding | ||||||||||||||||||
Stride | Stride |
The theory is that incrementing an integer is close to free on modern (i.e.
superscalar) systems, and so most
of the time will be spent on memory access. This can be seen by running the
following (or make check_hier
):
for i in 2 4 8 `seq 16 16 512` `seq 768 256 8196`; do
env OMP_NUM_THREADS=1 ./inc 0.1 100 $i 1024 0 0
done
and verifying that the reported latency correlates with the cache hierarchy.
Currently, inc.c
relies on OpenMP to provide
multithreading functionality and atomic increments. While this is fairly
portable and easy to code, it's not foolproof. With the x86 ISA, this C code:
data[idx] += 1;
compiles (with gcc -S -masm=intel -O1
) to something like:
add DWORD PTR [rax+rdx*4], 1
Adding the OpenMP atomic pragma:
#pragma omp atomic
data[idx] += 1;
produces minimal overhead; namely the lock
prefix:
lock add DWORD PTR [rax+rdx*4], 1
This produces the expected results, and allows for reasonable comparisons between single-threaded and multi-threaded runs.
Switching to the ARM ISA, the default assembly from gcc for the non-atomic case is:
ldr r2, [r3, r5]
add r2, r2, #1
str r2, [r3, r5]
Adding the atomic pragma results in:
bl GOMP_atomic_start
ldr r3, [r5, r7]
add r3, r3, #1
str r3, [r5, r7]
bl GOMP_atomic_end
which adds two function calls right in the hot path. This is extremely unfair
to the atomic case, and does not produce reasonable results. This can be
(mostly) fixed by letting GCC know that it can target modern ARM chips
(gcc -march=armv7-a
), thus generating:
dmb sy
.LSYT354:
ldrex r3, [r5]
add r3, r3, r9
strex r2, r3, [r5]
teq r2, #0
bne .LSYT354
.LSYB354:
dmb sy
which produces mostly decent results.
As a quick aside, this is an excellent example of RISC vs CISC architectures: the CISC (x86) ISA adds complex synchronization functionality with a single byte prefix, whereas the RISC (ARM) ISA requires multiple, explicit instructions (including a retry loop!) to accomplish approximately the same thing.
CFLAGS
can be edited in the Makefile, or passed in from the environment:
CFLAGS='-march=armv7-a' make
To quickly check things, run make test
. All the input parameters are echoed
to stderr, and the results (with headers) are printed to stdout. If everything
looks good, running make tinyreport
(and waiting five minutes) will produce a
CSV file with more extensive results. make stdreport
will take much longer,
and produce more robust results; additional make targets can be added as
necessary.
To get a visual overview of the results, make sure you have Pandas and Matplotlib
installed and run ./plotreport.py stdreport
.