ScratchFormers

implementing transformers from scratch.

Attention is all you need.

Modules

einops starter
attentions
- multi-head causal attention
- multi-head cross attention
- multi-head grouped query attention (torch + einops)
positional embeddings
- rotary positional embeddings (RoPE)
Low-Rank Adaptation (LoRA)
- implementing LoRA based on this wonderful tutorial by Sebastian Raschka
- finetuning LoRA adapted deberta-v3-base on IMDb dataset

Models

simple Vision Transformer
- for process, check building_ViT.ipynb
- model implementation
- used mean pooling instead of [class] token
GPT2
- for process, check buildingGPT2.ipynb
- model implementation
- built in such a way that it supports loading pretrained openAI/huggingface weights gpt2-load-via-hf.ipynb
- for my own custom trained causal LM, checkout shakespeareGPT which is although a bit more like GPT-1.
OpenAI CLIP
- implemented ViT-B/32 variant
- for process, check building_clip.ipynb
- inference req: install clip for tokenization and preprocessing: pip install git+https://github.com/openai/CLIP.git
- model implementation
- zero-shot inference code
- built in such a way that it supports loading pretrained openAI weights and IT WORKS!!!
- My lighter implementation of this using existing image and language models trained on Flickr8k dataset is available here: liteCLIP
Encoder Decoder Transformer
- for process, check building_encoder-decoder.ipynb
- model implementation
- src_mask for encoder is optional but is nice to have since it is used to mask out the pad tokens so attention is not considered for those tokens.
- used learned embeddings for position instead of sin/cos as per the OG.
- I trained a model for multilingual machine translation.
  - Translates english to hindi and telugu.
  - change: single encoder & decoder embedding layer since I used a single tokenizer.
  - for the code and results check: shreydan/multilingual-translation
BERT - MLM
- for process of masked language modeling, check masked-language-modeling.ipynb
- model implementation
- simplification: for pre-training no use of [CLS] & [SEP] tokens since I only built the model for masked language modeling and not for next sentence prediction.
- I trained an entire model on the wikipedia dataset, more info in shreydan/masked-language-modeling repo.
- once, pretrained the MLM head can be replaced with any other downstream task head.
ViT MAE
- Paper: Masked autoencoders are scalable vision learners
- model implementation
- for process, check: building-vitmae.ipynb
- Quite reliant on the original code released by authors.
- Only simplification: No [CLS] token so used mean pooling
- The model can be trained 2 ways:
  - For pretraining: the decoder can be thrown away and the encoder can be used for downstream tasks
  - For visualization: can be used to reconstruct masked images.
- I trained a smaller model for reconstruction visualization: ViTMAE on Animals Dataset

Requirements

einops
torch
torchvision
numpy
matplotlib
pandas

Here's my puppy's picture:

God is our refuge and strength, a very present help in trouble.
Psalm 46:1

shreydan/scratchformers

ScratchFormers

implementing transformers from scratch.

Modules

Models

Requirements