A CUDA approach for computing the multiplication of a transposed matrix with the initial one.
This repository contains three different implementations for computing the AT⋅A using the cuBLAS library.
- cuBLASDgemm implementation
- Simple implementation
- Simple implementation with shared memory