rayding24/First-Principles-Transformers

An end-to-end walk through of transformer architecture and heuristics, for my own (and potentially others') learning purposes. Updated continuously.

Jupyter NotebookMIT

Transformers implemented and explained from scratch, culminating in (smaller scale) GPT3 and iGPT

An end-to-end walk through of transformer architecture and heuristics, for my own (and potentially others') learning purposes. Updated continuously.

Current Version:

TODO:

Training
GPT training
iGPT training

Optional Exploration:

Newer Attention Mechanisms with better time complexity (Linformer, Reformer, etc.)
Computer Vision (Vision Transformer)
PyTorch Lightning refactor

Biblio:

The author wish to acknowledge the following sources (in-text citations can also be found in the notebooks).

Original papers:

Attention Is All You Need
Language Models are Unsupervised Multitask Learners
Layer Normalization
Language Models are Few-Shot Learners

Repo:

_minGPT is adapted from
karpathy/minGPT:https://github.com/karpathy/minGPT/blob/master

The following material has been inspirational and helpful for understanding transformers better:

http://peterbloem.nl/blog/transformers
http://jalammar.github.io/illustrated-transformer/
https://jalammar.github.io/illustrated-gpt2/
http://juditacs.github.io/2018/12/27/masked-attention.html

These material has been interesting to explore:

https://nlp.seas.harvard.edu/2018/04/03/attention.html
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
https://stackoverflow.com/questions/50747947/embedding-in-pytorch
https://www.reddit.com/r/MachineLearning/comments/cttefo/d_positional_encoding_in_transformer/