______ __ _______
| |.-----.----.---.-.| | __|.-----.--------.--------.
| ---|| _ | _| _ || | | || -__| | |
|______||_____|__| |___._||__|_______||_____|__|__|__|__|__|__|
sudo apt install rocm-dkms rocm-libs
to install all prerequisites.
cd src
make
-
16 GB devices (Radeon VII):
./gemm R_64F R_64F R_64F R_64F OP_N OP_T 8640 8640 8640 8640 8640 8640 9 300
-
32 GB devices (MI60, MI100):
./gemm R_64F R_64F R_64F R_64F OP_N OP_T 8640 8640 8640 8640 8640 8640 18 300
-
16 GB devices (Radeon VII):
./gemm R_32F R_32F R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 18 300
-
32 GB devices (MI60, MI100):
./gemm R_32F R_32F R_32F R_32F OP_N OP_T 8640 8640 8640 8640 8640 8640 36 300
./gemm PRECISION_A
PRECISION_B
PRECISION_C
COMPUTE_PRECISION
OP_A
OP_B
M
N
K
LDA
LDB
LDC
BATCH_COUNT
TIME_SPAN runtime duration in seconds
[batched] run batched GEMM
[strided] run strided batched GEMM
[ex] use the Ex API
[hostA] A in host memory
[hostB] B in host memory
[hostC] C in host memory
[coherentA] if in host memory, A is coherent (not cached)
[coherentB] if in host memory, B is coherent (not cached)
[coherentC] if in host memory, C is coherent (not cached)
[sharedA] one A for all devices
[sharedB] one B for all devices
[zeroBeta] set beta to zero
[testing] perform a basic sanity check
[times] print time in microseconds in addition to GFLOPS
[hostname] print the hostname
When TIME_SPAN
is set to 0, one warmup run is done, followed by one timing run, and printing of column labels is disabled.
R_32F
: floatR_64F
: doubleC_32F
: float complexC_64F
: float doubleR_8I
: 8-bit intR_32I
: 32-bit int
OP_N
: non-transposedOP_T
: transposedOP_C
: conjugate-transposed
- benchmarks
hipblas?gemm[Batched|StridedBatched][Ex]
- allocates
BATCH_SIZE
number of matrices A, B, and C - initializes with hipRAND (random uniform, 0.0 to 1.0)
- calls hipBLAS and collects execution times using
std::chrono
- sets
alpha
to 2.71828 andbeta
to 3.14159 - for
hipblas?gemm[Ex]
launches a sequence of calls and takes the median time - for
hipblas?gemm[Strided]Batched[Ex]
launches one call and takes the overall time - reports the corresponding GFLOPS
- repeats until
TIME_SPAN
exceeded - executes simulteneously on all devices
If testing
is set, a primitive sanity test is ran.
Entries of A, B, and C are set to 1, and so are the factors alpha
and beta
.
Then, after GEMM is ran, all entries of C are checked to contain k+1.
Note that performance is usually much higher when using integer initialization
then when using random data.
Jakub Kurzak (jakurzak@amd.com)