Fast and generic implementation using OpenMP and CUDA

Question

Fast and generic implementation using OpenMP and CUDA

shikishima-TasakiLab opened this issue 3 years ago · 3 comments

shikishima-TasakiLab commented 3 years ago

I have implemented a module using OpenMP and CUDA that runs faster while maintaining the memory efficiency of your CuPy implementation.

shikishima-TasakiLab/Involution-PyTorch

It also supports TorchScript and 16-bit float.

shikishima-TasakiLab/Involution-PyTorch#1

Answer 1 · 2021-06-29T07:02:45.000Z

Great work! It will help a lot in practice!
As I have mentioned in the README, would you please make a PR to contribute to this repo? Just to be on the safe side, I will run some experiments to double-check the reimplementation's correctness before merging it into the main branch. Thanks.

Answer 2 · 2021-06-29T07:51:43.000Z

I made a PR.
I did not merge the conflicting parts of the README, so please add module descriptions accordingly.

Answer 3 · 2021-06-29T08:16:29.000Z

OK, I will verify and merge it as soon as I could.