An end-to-end walk through of transformer architecture and heuristics, for my own (and potentially others') learning purposes. Updated continuously.
- Self-Attention
- Block Architecture
- Word Embedding
- Positional Embedding
- Word Embedding
- Masking
- Transformer Architecture
- GPT Architecture
- Training
- GPT training
- iGPT training
- Newer Attention Mechanisms with better time complexity (Linformer, Reformer, etc.)
- Computer Vision (Vision Transformer)
- PyTorch Lightning refactor
The author wish to acknowledge the following sources (in-text citations can also be found in the notebooks).
Attention Is All You Need
Language Models are Unsupervised Multitask Learners
Layer Normalization
Language Models are Few-Shot Learners
_minGPT is adapted from
karpathy/minGPT:https://github.com/karpathy/minGPT/blob/master
http://peterbloem.nl/blog/transformers
http://jalammar.github.io/illustrated-transformer/
https://jalammar.github.io/illustrated-gpt2/
http://juditacs.github.io/2018/12/27/masked-attention.html
https://nlp.seas.harvard.edu/2018/04/03/attention.html
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
https://stackoverflow.com/questions/50747947/embedding-in-pytorch
https://www.reddit.com/r/MachineLearning/comments/cttefo/d_positional_encoding_in_transformer/