awesomeMLSys: A repository from nazariyv

ML Systems Onboarding Reading List

This is a reading list of papers/videos/repos I've personally found useful as I was ramping up on ML Systems and that I wish more people would just sit and study carefully during their work hours. If you're looking for more recommendations, go through the citations of the below papers and enjoy!

Attention Mechanism

Attention is all you need: Start here, Still one of the best intros
Online normalizer calculation for softmax: A must read before reading the flash attention. Will help you get the main "trick"
Self Attention does not need O(n^2) memory:
Flash Attention 2: The diagrams here do a better job of explaining flash attention 1 as well
Llama 2 paper: Skim it for the model details
gpt-fast: A great repo to come back to for minimal yet performant code
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation: There's tons of papers on long context lengths but I found this to be among the clearest
Google the different kinds of attention: cosine, dot product, cross, local, sparse, convolutional

Performance Optimizations

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems: Wonderful survey, start here
Efficiently Scaling transformer inference: Introduced many ideas most notably KV caches
Making Deep Learning go Brrr from First Principles: One of the best intros to fusions and overhead
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: The ZeRO algorithm behind FSDP and DeepSpeed intelligently reducing memory usage for data parallelism.
Megatron-LM: For an introduction to Tensor Parallelism
Fast Inference from Transformers via Speculative Decoding: This is the paper that helped me grok the difference in performance characteristics between prefill and autoregressive decoding
Group Query Attention: KV caches can be chunky this is how you fix it
Orca: A Distributed Serving System for Transformer-Based Generative Models: introduced continuous batching (great pre-read for the PagedAttention paper).
Efficient Memory Management for Large Language Model Serving with PagedAttention: the most crucial optimization for high throughput batch inference

Quantization

A White Paper on Neural Network Quantization: Start here this is will give you the foundation to quickly skim all the other papers
LLM.int8: All of Dettmers papers are great but this is a natural intro
FP8 formats for deep learning: For a first hand look of how new number formats come about
Smoothquant: Balancing rounding errors between weights and activations#

Long context length

RoFormer: Enhanced Transformer with Rotary Position Embedding: The paper that introduced rotary positional embeddings
YaRN: Efficient Context Window Extension of Large Language Models: Extend base model context lengths with finetuning
Ring Attention with Blockwise Transformers for Near-Infinite Context: Scale to infinite context lengths as long as you can stack more GPUs

Sparsity