/KRONMULT

Primary LanguageC++

Code to perform 6-dimensional batched kronecker product on GPU and CPU (using OpenMP).

Y(:,k) += kron(A1(:,:,k), ..., A6(:,:,k) ) * X(:,k), where k=1:batchCount

The code assumes each matrix A1, ..., A6 are all square and same shape of n by n.

Note there can be overlap in the output vectors Y(:,k). Atomic add updates will be performed.

Each vector X(:,k) or Y(:,k) is conceptually of shape n^6 by 1 but during the computation can be reshaped as n^5 by n or n^4 by n^2 or n^3 by n^3 or n^2 by n^4 or n by n^5.

Note that
Y = kron(A1,...,A6) * X can be evaluated as step 1: W = reshape( X, [n^5,n]) * transpose(A1) step 2: Y = kron( A2, ..., A6) * W, which can be viewed as "tail recursion"

At the lowest level of recursion Y = kron(A1) * X is simply implemented as matrix multiply Y = A1 * X

This implementation evaluates each kronecker product in a separate thread block on GPU. Instead of building a long batch list to call batched GEMM, the matrix-matrix multiplication is evaluated as calls to device functions

kgemm_nn() to evaluate C = alpha * A * B + beta * C or kgemm_nt() to evaluate C = alpha * A * transpose(B) + beta * C

For the special case of (beta == 1), atomicAdd update is used.

Note that the GEMM operations will be performed on very slender rectangular matrices. Therefore, the computations will not be dominated by floating point operations but by data movement, especially when n is small.


To compile the code for CPU (1) mkdir build && cd build (2) cmake ../ (3) make

To compile the code for Nvidia GPU (1) mkdir build && cd build (2) cmake ../ -DUSE_GPU=1 (3) make

To run the tester for kgemm_nn_batched, perform ./test_kgemm_nn_batched

To run the tester for kgemm_nt_batched, perform ./test_kgemm_nt_batched

To run the tester for kronmult6_batched, perform ./test_kronmult6_batched