Performance enhancements (batched predictions using GEMM)

Question

Performance enhancements (batched predictions using GEMM)

peterukk opened this issue 4 years ago · 1 comments

Hi,

Depending on the application - model and problem sizes - it's possible to make the inference very much faster by doing it in batches (packing vector-sized inputs into a 2D array) and replacing the matrix-vector multiplications by matrix-matrix which are delegated to a BLAS library. I have a Fortran application based on FKB, or actually its earlier incarnation neural-Fortran, where I did such that (I referenced neural-Fortran in my paper). It works well, and the nice thing is it's trivial to run the code on GPU's too:

#ifdef USE_CUDA
#define sgemm cublassgemm
#endif

plus some OpenACC directives above the bias addition and activation loops. You can find my code here. I think a similar batched output procedure for 2D arrays would be a valuable contribution to the main repo. I am happy to work on a pull request if you agree. If so let me know if you'd like to keep the GPU stuff: I'd have to add a few things to make it more general, like copying the input array to device, and creating the intermediate arrays for hidden layers (in my code I can get away with just two intermediate arrays where I do pointer swapping, because my models had the same number of neurons in all hidden layers).

There are a few other points too:

should DGEMM be called if input data is in double precision?
if the pointer-based activation functions of the current code are used, last I checked those can't be elemental functions, which is what I used

Answer 1 · 2023-05-01T18:54:27.000Z

Have this been implemented in FKB? We are having trouble with our NNs being overly slow in fortran and it seems this may help