FastKron is an efficient and fast library for Kronecker Matrix Matrix Multiplication on both Single GPU and Multi GPUs. FastKron performs orders of magnitude better than GPyTorch by avoiding transpose of the shuffle algorithm. FastKron obtains upto 85% of the maximum FLOPs of both NVIDIA Tesla V100 and NVIDIA Tesla A100. FastKron supports several datatypes: float, double, int, float.
This repository provides the source code of FastKron, Makefile, test cases, and the API.
FastKron requires generating CUDA kernels for one or more problem sizes using
src/gen_tuner_kernels.py
. For example, generating CUDA kernels for M = 1024, N = 5, P = 8 with OpX and OpF set to N.
python src/gen_tuner_kernels.py -same-factors 5 8,8 -backend cuda -opX N -opF N
Then we can build libFastKron.so
using
cd build/
cmake ..
make -j
To run tests execute
python tests/run-tests.py
FastKron provide following API functions:
fastKronInit/fastKronDestroy
initializes or destroys fastKron handle.<type>gekmm
does single GPU KronMatmul, where the type follows BLAS conventions, i.e.,s
for float,d
for double,i
for integer, andl
for long.<type>gekmmTune
performs autotuning for a given size over all compiled CUDA kernels and stores the best CUDA kernel series in its internal state.kronDistributed<type>GEMM
does multi GPU KronMatmul and follows the same convention as its single-GPU counterpart
The repository contains a Python module, PyFastKron
. The module is a Python wrapper over CUDA API functions.
It can be installed with
python setup.py install
An example CUDA program to use FastKron is written as follows:
//example.cu
#include <fastKron.h>
int main() {
//Define Problem Sizes
int N = 5;
int M = 1024;
int P = 8, Q = 8;
//Allocate inputs and output
float* x, *fs[], *y;
cudaMalloc(&x, M * (int)powf(P, N) * sizeof(float));
for (int i = 0; i < N; i++) cudaMalloc(&fs[i], P*Q * sizeof(float));
cudaMalloc(&y, M * (int)powf(Q, N) * sizeof(float));
//Initialize FastKron
fastKronHandle handle;
fastKronInit(&handle);
//Get Temporary size and allocate temporary
size_t tempSize;
gekmmSizes(handle, M, N, P, Q, nullptr, &tempSize);
float* temp;
cudaMalloc(&temp, tempSize * sizeof(float));
//Tune for best performing kernel
sgekmmTune(handle, M, N, P, Q, 0);
//Do KronMatmul using the tuned kernel
sgekmm(handle, M, N, P, Q,
x, fs, y, 1, 0, nullptr,
temp, nullptr, 0);
//Destroy FastKron
fastKronDestroy(handle);
}
Compiling using nvcc, add the include directory, and link to libFastKron.so
nvcc example.cu -Isrc/ -L build/ -lFastKron -o example
Run the example
./example