Prerequisite
- pytorch
- cupy
You will also need a NVidia GPU to run the code.
Day 1
Implement a JIT compiler using Python decorator!
Day 2
Implement a simple matrix exp
function in CUDA!
Day 3
Make the exp
kernel more efficient by using more parallelism! Now the performance already matches cuBLAS.
Day 4
Simplify the kernel code by using 2D partitioning. The pitfall is partitioning the rows to x dim.
Day 5
First taste of fusion by creating a fused exp-div kernel!