/nanoPyC

Primary LanguagePython

Prerequisite

  • pytorch
  • cupy

You will also need a NVidia GPU to run the code.

Day 1

Implement a JIT compiler using Python decorator!

Day 2

Implement a simple matrix exp function in CUDA!

Day 3

Make the exp kernel more efficient by using more parallelism! Now the performance already matches cuBLAS.

Day 4

Simplify the kernel code by using 2D partitioning. The pitfall is partitioning the rows to x dim.

Day 5

First taste of fusion by creating a fused exp-div kernel!