nanodecoder

A simple implementation of a Transformer decoder (GPT) that will hopefully include the following:

SwiGLU: instead of ReLU / GELU as an activation function in MLPs
RoPE: encode position with rotation matrix
Partial rotary embedding: apply RoPE only to part of q, v vectors
Pre-LN transformer: layer normalization placed in residual blocks
RMSNorm: regularize neuron inputs in a layer using root mean square
Mutli-head latent attention: compress key, values for inference efficiency
Sparse mixtrue of experts: multiple MLPs and a router that dynamically routes input tokens to specific experts for processing

Inspired by Andrej Karpathy's NanoGPT

mrbesher/nanodecoder