Transformer in Triton

Triton enables ultra-fast speedups in machine learning due to fused kernels, this an ongoing attempt to implement the entire transformer model into triton for a massive speed increase.

So far we have the attention, now we need feedforward and rotary positional embeddings,