A simple implementation of a Transformer decoder (GPT) that will hopefully include the following:
- SwiGLU: instead of ReLU / GELU as an activation function in MLPs
- RoPE: encode position with rotation matrix
- Partial rotary embedding: apply RoPE only to part of q, v vectors
- Pre-LN transformer: layer normalization placed in residual blocks
- RMSNorm: regularize neuron inputs in a layer using root mean square
- Mutli-head latent attention: compress key, values for inference efficiency
- Sparse mixtrue of experts: multiple MLPs and a router that dynamically routes input tokens to specific experts for processing
Inspired by Andrej Karpathy's NanoGPT