Rewrite matrices in the CUDA implementation

Question

Rewrite matrices in the CUDA implementation

Opened this issue a year ago · 0 comments

The CUDA implementation is very slow because of how the cuda matrices are implemented. All the host methods that call the global kernels have multiple calls to cudaMalloc and cudaMemcpy, and this increases the execution times greatly.
The class should be reimplemented so that the device pointers are class attributes, and in this way the kernels can be called directly on them withoud having to allocate and copy memory every time.