/First-Principles-Transformers

An end-to-end walk through of transformer architecture and heuristics, for my own (and potentially others') learning purposes. Updated continuously.

Primary LanguageJupyter NotebookMIT LicenseMIT

Transformers implemented and explained from scratch, culminating in (smaller scale) GPT3 and iGPT

An end-to-end walk through of transformer architecture and heuristics, for my own (and potentially others') learning purposes. Updated continuously.

Current Version:

  • Self-Attention
  • Block Architecture
  • Word Embedding
  • Positional Embedding
  • Word Embedding
  • Masking
  • Transformer Architecture
  • GPT Architecture

TODO:

  • Training
  • GPT training
  • iGPT training

Optional Exploration:

  • Newer Attention Mechanisms with better time complexity (Linformer, Reformer, etc.)
  • Computer Vision (Vision Transformer)
  • PyTorch Lightning refactor

Biblio:

The author wish to acknowledge the following sources (in-text citations can also be found in the notebooks).

Original papers:

Attention Is All You Need
Language Models are Unsupervised Multitask Learners
Layer Normalization
Language Models are Few-Shot Learners

Repo:

_minGPT is adapted from
karpathy/minGPT:https://github.com/karpathy/minGPT/blob/master

The following material has been inspirational and helpful for understanding transformers better:

http://peterbloem.nl/blog/transformers
http://jalammar.github.io/illustrated-transformer/
https://jalammar.github.io/illustrated-gpt2/
http://juditacs.github.io/2018/12/27/masked-attention.html

These material has been interesting to explore:

https://nlp.seas.harvard.edu/2018/04/03/attention.html
https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
https://stackoverflow.com/questions/50747947/embedding-in-pytorch
https://www.reddit.com/r/MachineLearning/comments/cttefo/d_positional_encoding_in_transformer/