Triton enables ultra-fast speedups in machine learning due to fused kernels, this an ongoing attempt to implement the entire transformer model into triton for a massive speed increase.
So far we have the attention, now we need feedforward and rotary positional embeddings,
pip install triton-transformer
MIT