PyTorch implementation of transformer algorithms described in "Formal Algorithms for Transformers" by Mary Phuong and Marcus Hutter: https://arxiv.org/abs/2207.09238
Algorithm 1: Token embedding
Algorithm 2: Positional embedding
Algorithm 3: Basic single-query attention
Algorithm 4: π½Λ β Attention(πΏ, π|Wπππ, Mask)
Algorithm 5: π½Λ β MHAttention(πΏ, π|W, Mask)
Algorithm 6: Λπ β layer_norm(π|πΈ, π·)
Algorithm 7: Unembedding.
Algorithm 8: π· β EDTransformer(π, π|π½)
Algorithm 9: π· β ETransformer(π|π½)
Algorithm 10: π· β DTransformer(π|π½)
Algorithm 11: π½Λ β EDTraining(π1:πdata , π1:πdata , π½)
Algorithm 12: π½Λ β ETraining(π1:πdata , π½)
Algorithm 13: π½Λ β DTraining(π1:πdata , π½)
Algorithm 14: π β DInference(π, π½Λ)