Thanks and self-recommendation
Opened this issue · 4 comments
Thank you for the code that taught me how to use oneAPI, and I wrote some code myself to implement matrix multiplication and it is faster than MKL.
My Project Links: https://github.com/Perry961002/oneAPINote
@Perry961002 it's great to see you have done the great work w/ SYCL in GPU.
I saw you have wrote different version of GEMMs and it's amazing :)
I will check and run you code soon and give you further suggestions. Meanwhile, you can try to calculate what the peak Gflop in this GPU and how much your code achieved.
Good jobs again!
@Perry961002 it's great to see you have done the great work w/ SYCL in GPU. I saw you have wrote different version of GEMMs and it's amazing :)
I will check and run you code soon and give you further suggestions. Meanwhile, you can try to calculate what the peak Gflop in this GPU and how much your code achieved.
Good jobs again!
I mainly used the sub-matrix partitioning method, and I temporarily changed the layout of the elements of matrices A and B to speed up the memory access
@Perry961002 it's great to see you have done the great work w/ SYCL in GPU. I saw you have wrote different version of GEMMs and it's amazing :)
I will check and run you code soon and give you further suggestions. Meanwhile, you can try to calculate what the peak Gflop in this GPU and how much your code achieved.
Good jobs again!I mainly used the sub-matrix partitioning method, and I temporarily changed the layout of the elements of matrices A and B to speed up the memory access
It's a common approach. And you can also try to hide layout change by shared local memory.
@Perry961002很高兴看到您在 GPU 中使用 SYCL 完成了出色的工作。我看到你写了不同版本的 GEMM,这太棒了:)
我会尽快检查并运行你的代码,并给你进一步的建议。同时,您可以尝试计算该 GPU 的峰值 Gflop 以及您的代码达到了多少。
又好工作了!我主要采用了子矩阵划分的方式,暂时改变了矩阵A和B的元素布局,以加快内存访问速度
这是一种常见的方法。您还可以尝试通过共享本地内存来隐藏布局更改。
Yes, that's the method I used