Fooling around with flops/clock in the famous SGEMM - what could be more fun? GEMM generally does C += A * B
where A, B and C are large-ish dense matrices of single (SGEMM) or double precision (DGEMM) floats.
The low-tech bash script build_sgemm.sh
will try to build the test for a recognized host architectures - substitute the compiler for one of your choice. Macros of interest, passed with -D
on the command line:
ALT
- implementation alternatives- -1 - scalar version
- 0 - 16-element-wide version suitable for autovectorizers
- 1 - 2x16-element-wide SSE2 (x86/amd64) version
- 2 - 64-element-wide AVX256 (x86/amd64) version
- 3 - 2x32-element-wide AVX256 (x86/amd64) version
- 4 - 16-element-wide ASIMD2 (aarch64) version
- 5 - 32-element-wide ASIMD2 (aarch64) version
- 6 - 2x16-element-wide ASIMD2 (aarch64) version
- 7 - 2x32-element-wide ASIMD2 (aarch64) version
- 8 - 2x16-element-wide MSA (mips32/mips64) version
- 9 - 2x32-element-wide AVX512 (x86/amd64) version
- 10 - 2x64-element-wide AVX512 (x86/amd64) version
- 11 - 64-element-wide SVE512 (aarch64) version
- 12 - 2x64-element-wide SVE512 (aarch64) version
- 13 - 4x64-element-wide SVE512 (aarch64) version
PREFETCH
- distance, in floats, to prefetch in the innermost loop (0 for no prefetch; unused in the scalar version)MATX_SIZE
- dimension of the square matrices A, B & CREP_EXP
- exponent of the number of repetitions of the test, ie. 1eEXPPRINT_MATX
- print out C on the standard output (for debugging)
To tell what prefetch works best on a given CPU and matrix dimension, use something along the following (pick ALT wisely):
for i in {0..10} ; do ./build_sgemm.sh -DALT=1 -DPREFETCH=`echo "512 + 512 * $i" | bc` -DMATX_SIZE=512 -DREP_EXP=1 ; ./sgemm ; done
Best results measured in SP flops/clock by the formula:
MATX_SIZE^3 * 2 * 10^REP_EXP / (CPU_freq * duration)
CPU (single thread only) | width of SIMD ALU | RAM GB/s | LLC visible per core | 64x64 | 512x512 | remarks 1 |
---|---|---|---|---|---|---|
AMD C60 (Bobcat) | 2-way | 8.53 | 512 KB | 1.94 | 1.47 | g++ 4.8, ALT = 1, PREFETCH = 2560, SSE2 intrinsics, 1.33GHz |
Intel Core2 T5600 | 4-way | 5.33 | 2 MB | 3.31 | 2.82 | clang++ 3.4, ALT = 1, PREFETCH = 4096, SSE2 intrinsics, 1.83GHz |
Intel Core2 P8600 | 4-way | 8.53 | 3 MB 2 | 4.86 | 4.14 | apple clang++ 8.1, ALT = 1, PREFETCH = 2048, SSE2 intrinsics, 2.40GHz |
Intel E5-2687W (SNB) | 8-way | 25.6 | 20 MB 2 | 13.79 | 10.17 | clang++ 3.6, ALT = 3, PREFETCH = 3584, AVX256 intrinsics, 3.1GHz |
Intel E5-2687W (SNB) | 8-way | 25.6 | 20 MB 2 | 14.27 | 10.25 | g++ 4.8, ALT = 3, PREFETCH = 3584, AVX256 intrinsics, 3.1GHz |
Intel E3-1270v2 (IVB) | 8-way | 25.6 | 8 MB 2 | 13.40 | 11.05 | clang++ 3.6, ALT = 3, PREFETCH = 3072, AVX256 intrinsics, 1.6GHz |
Intel E3-1270v2 (IVB) | 8-way | 25.6 | 8 MB 2 | 14.01 | 11.22 | g++ 4.8, ALT = 3, PREFETCH = 3072, AVX256 intrinsics, 1.6GHz |
Intel i7-4770 (HSW) | 8-way | 25.6 | 8 MB 2 | 22.72 | 11.65 | g++ 5.1, ALT = 3, PREFETCH = 2560, AVX256+FMA3 intrinsics, 3.9GHz |
Xeon Silver 4208 (CSL) | 8-way | 42.7 | 11 MB 2 | 16.76 | 13.10 | clang++ 7.0, ALT = 3, PREFETCH = 4608, AVX256+FMA3 intrinsics, 3.0GHz |
Xeon Silver 4208 (CSL) | 16-way | 42.7 | 11 MB 2 | 30.28 | 16.50 | clang++ 7.0, ALT = 9, PREFETCH = 2048, AVX512 intrinsics, 2.0GHz |
Xeon W-2155 (SKL) | 16-way | 75.5 | 13.75 MB 2 | 50.00 | 16.62 | clang++ 10.0, ALT = 10, PREFETCH = 2560, AVX512 intrinsics, 4.0GHz |
AMD Ryzen 1700X (Zen) | 4-way | 37.5 | 16 MB 2 | 14.15 | 10.22 | clang++ 3.8, ALT = 3, PREFETCH = 3072, AVX256 intrinsics, 3.4GHz |
RK3368 (Cortex-A53) | 2-way | 6.4 | 512 KB 3 | 3.12 | 1.39 | clang++ 3.6, ALT = 7, PREFETCH = 1536, ASIMD2 intrinsics, 1.51GHz |
RK3399 (Cortex-A72) | 4-way | 12.8 | 1 MB | 6.81 | 2.01 | clang++ 7.0, ALT = 7, PREFETCH = 2560, ASIMD2 intrinsics, 1.8GHz 4 |
Allwinner A64 (Cortex-A53) | 2-way | 4.42 | 512 KB | 3.18 | 1.38 | clang++ 3.6, ALT = 6, PREFETCH = 2560, ASIMD2 intrinsics, 1.152GHz 5 |
MT8163A (Cortex-A53) | 2-way | 6.4 | 512 KB | 3.09 | 1.65 | clang++ 3.6, ALT = 7, PREFETCH = 1536, ASIMD2 intrinsics, 1.5GHz |
MT8173C (Cortex-A53) A32 | 2-way | 12.8 | 512 KB | 1.62 | 1.01 | clang++ 6.0, ALT = 6, PREFETCH = 2560, ASIMD intrinsics, 1.7GHz 6 |
MT8173C (Cortex-A53) | 2-way | 12.8 | 512 KB | 2.68 | 1.44 | clang++ 6.0, ALT = 6, PREFETCH = 2560, ASIMD2 intrinsics, 1.7GHz 5 |
MT8173C (Cortex-A72) A32 | 4-way | 12.8 | 1 MB | 3.23 | 1.81 | clang++ 6.0, ALT = 7, PREFETCH = 2560, ASIMD intrinsics, 2.1GHz 6 |
MT8173C (Cortex-A72) | 4-way | 12.8 | 1 MB | 6.82 | 2.30 | clang++ 6.0, ALT = 7, PREFETCH = 2560, ASIMD2 intrinsics, 2.1GHz 4 |
Marvell A8040 (Cortex-A72) | 4-way | 12.8 | 1 MB | 6.52 | 2.91 | clang++ 3.5, ALT = 7, PREFETCH = 1536, ASIMD2 intrinsics, 1.3GHz 4 |
NXP LX2160A (Cortex-A72) | 4-way | 19.2 | 1 MB | 6.70 | 4.08 | clang++ 6.0, ALT = 7, PREFETCH = 1536, ASIMD2 intrinsics, 2.0GHz 4 |
AWS Graviton (Cortex-A72) | 4-way | 19.2 | 2 MB | 6.81 | 4.12 | clang++ 6.0, ALT = 7, PREFETCH = 1024, ASIMD2 intrinsics, 2.28GHz 4 7 |
Amlogic S922X (Cortex-A53) | 2-way | 10.56 | 256 KB | 2.65 | 1.47 | clang++ 6.0, ALT = 6, PREFETCH = 2560, ASIMD2 intrinsics, 1.896GHz 5 |
Amlogic S922X (Cortex-A73) | 4-way | 10.56 | 1 MB | 5.20 | 2.21 | clang++ 6.0, ALT = 6, PREFETCH = 2048, ASIMD2 intrinsics, 1.8GHz 5 |
Snapdragon 835 (Cortex-A73) | 4-way | 14.93 | 2 MB | 5.93 | 3.07 | clang++ 9.0, ALT = 7, PREFETCH = 2048, ASIMD2 intrinsics, 2.55GHz |
Snapdragon 835 (Cortex-A73) | 4-way | 14.93 | 2 MB | 5.43 | 3.92 | clang++ 6.0, ALT = 6, PREFETCH = 2048, ASIMD2 intrinsics, 2.55GHz 5 |
Snapdragon SQ1 (Cortex-A76) | 4-way | 34.13 | 4 MB 2 | 15.29 | 6.79 | clang++ 9.0, ALT = 7, PREFETCH = 1536, ASIMD2 intrinsics, 3.0GHz |
NVIDIA armv8.2 (Carmel) | 4-way | 51.2 | 4 MB 2 | 13.49 | 6.90 | clang++ 9.0, ALT = 7, PREFETCH = 3584, ASIMD2 intrinsics, 1.91GHz |
Fujitsu armv8.2 (A64fx) | 16-way | 256 | 8 MB 2 | 13.86 | 13.33 | g++ 10.2, ALT = 11, PREFETCH = 4096, SVE intrinsics, 2.2GHz 4 |
Fujitsu armv8.2 (A64fx) | 16-way | 256 | 8 MB 2 | 27.89 | 23.26 | g++ 10.2, ALT = 12, PREFETCH = 4096, SVE intrinsics, 2.2GHz 4 |
Fujitsu armv8.2 (A64fx) | 16-way | 256 | 8 MB 2 | 13.32 | 13.46 | armclang++ 20.3, ALT = 11, PREFETCH = 3584, SVE intrinsics, 2.2GHz 4 |
Fujitsu armv8.2 (A64fx) | 16-way | 256 | 8 MB 2 | 28.20 | 26.28 | armclang++ 20.3, ALT = 12, PREFETCH = 4608, SVE intrinsics, 2.2GHz 4 |
Fujitsu armv8.2 (A64fx) | 16-way | 256 | 8 MB 2 | 39.64 | 36.17 | armclang++ 20.3, ALT = 13, PREFETCH = 3072, SVE intrinsics, 2.2GHz |
Apple armv8.4 (Firestorm) | 4-way | 58 | 12 MB 2 | 30.97 | 22.32 | apple clang++ 12.0, ALT = 7, PREFETCH = 3072, ASIMD2 intrinsics, 3.2GHz 8 |
Baikal-T1 (MIPS P5600) | 4-way | 6.4 | 1 MB | 3.85 | 2.00 | g++ 7.3, ALT = 8, PREFETCH = 4096, MSA intrinsics, 1.2GHz 9 |
Baikal-T1 (MIPS P5600) | 4-way | 6.4 | 1 MB | 3.74 | 2.09 | g++ 7.3, ALT = 8, PREFETCH = 4096, MSA intrinsics, 1.2GHz 9 10 |
Footnotes
-
Prefetch applies only to 512x512 and is tuned for the given core clock; 64x64 is not prefetched. ↩
-
The entirety of 512x512 matrices fit in LLC; LLC runs in the clock domain of the cores on SNB & IVB, but in its own clock domain on HSW. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12 ↩13 ↩14 ↩15 ↩16 ↩17 ↩18
-
Amount of shared L2 in the 'big' cluster. ↩
-
Non-native compiler tuning -mtune=cortex-a57. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
Small dataset (64x64) uses ALT=7, big dataset (512x512) uses ALT=6. ↩ ↩2 ↩3 ↩4 ↩5
-
Core part of AWS EC2 instance. ↩
-
Prefetch makes little difference. ↩
-
Large variance in the 512x512 times -- best result listed. ↩ ↩2
-
Non-native compiler tuning -mtune=mips32r5. ↩