This repo serves as an introduction to compiling numerical Python programs (A Python compiler). The source code is kept as simple as possible and each day is a small step forward based on the previous day. We will start from hand-coding simple functions, and gradually try to automatally generate/optimize them using a compiler.
Following the tutorial, you will learn:
- Basic CUDA programming
- How to accelerate your numerical code using parallelization, fusion and tiling
- How a compiler automatically generates and optimizes numerical code/loops
The end result is, given a function like the following:
def foo(a):
b = exp(a)
c = b.sum(axis=1)
d = b / c
return d
You will know how to make this code run much faster on CPU/GPU by just adding one line, like this
@compiler.jit
def foo(a):
b = exp(a)
c = b.sum(axis=1)
d = b / c
return d
and understand why it runs faster!
- pytorch
- cupy
You will also need a Nvidia GPU to run the code. For now we generate CUDA code and compute on the GPU by default.
Implement a JIT compiler using Python decorator!
Implement a simple matrix exp
function in CUDA!
Make the exp
kernel more efficient by using more parallelism! Now the performance already matches cuBLAS.
Simplify the kernel code by using 2D partitioning. The pitfall is partitioning the rows to x dim.
First taste of fusion by creating a fused exp-div kernel!
Introducing reduction by a simple implementation of softmax
.
Use parallel reduction and shared memory to make the softmax
implementation more efficient.
More parallel reduction.
Make the softmax
even more efficient by storing exp(a)
in fast memory and eliminate reloading from global memory.
Trying out template-based dynamic code generation.