This repo evaluates different matrix multiplication implementations given two large square matrices (2000-by-2000 in the following example):
Implementation | Long description |
---|---|
Naive | Most obvious implementation |
Transposed | Transposing the second matrix for cache efficiency |
sdot w/o hints | Replacing the inner loop with BLAS sdot() |
sdot with hints | sdot() with a bit unrolled loop |
SSE sdot | vectorized sdot() with explicit SSE instructions |
SSE+tiling sdot | SSE sdot() with loop tiling |
OpenBLAS sdot | sdot() provided by OpenBLAS |
OpenBLAS sgemm | sgemm() provided by OpenBLAS |
To compile the evaluation program:
make CBLAS=/path/to/cblas/prefix
or omit the CBLAS
setting you don't have it. After compilation, use
./matmul -h
to see the available options. Here is the result on my machines:
Implementation | -a | Linux,-n2000 | Linux,-n4000 | Linux/icc,-n4000 | Mac,-n2000 |
---|---|---|---|---|---|
Naive | 0 | 7.53 sec | 188.85 sec | 173.76 sec | 77.45 sec |
Transposed | 1 | 6.66 sec | 55.48 sec | 21.04 sec | 9.73 sec |
sdot w/o hints | 4 | 6.66 sec | 55.04 sec | 21.35 sec | 9.70 sec |
sdot with hints | 3 | 2.41 sec | 29.47 sec | 21.69 sec | 2.92 sec |
SSE sdot | 2 | 1.36 sec | 21.79 sec | 22.18 sec | 2.92 sec |
SSE+tiling sdot | 7 | 1.11 sec | 10.84 sec | 10.97 sec | 1.90 sec |
OpenBLAS sdot | 5 | 2.69 sec | 28.87 sec | 5.61 sec | |
OpenBLAS sgemm | 6 | 0.63 sec | 4.91 sec | 0.86 sec | |
uBLAS | 7.43 sec | 165.74 sec | |||
Eigen | 0.61 sec | 4.76 sec |
The machine configurations are as follows:
Machine | CPU | OS | Compiler |
---|---|---|---|
Linux | 2.6 GHz Xeon E5-2697 | CentOS 6 | gcc-4.4.7/icc-15.0.3 |
Mac | 1.7 GHz Intel Core i5-2557M | OS X 10.9.5 | clang-600.0.57/LLVM-3.5svn |
On both machines, OpenBLAS-0.2.18 is compiled with the following options (no AVX or multithreading):
TARGET=CORE2
BINARY=64
USE_THREAD=0
NO_SHARED=1
ONLY_CBLAS=1
NO_LAPACK=1
NO_LAPACKE=1