This is a reading list of papers/videos/repos I've personally found useful as I was ramping up on ML Systems and that I wish more people would just sit and study carefully during their work hours. If you're looking for more recommendations, go through the citations of the below papers and enjoy!
- Attention is all you need: Start here, Still one of the best intros
- Online normalizer calculation for softmax: A must read before reading the flash attention. Will help you get the main "trick"
- Self Attention does not need O(n^2) memory:
- Flash Attention 2: The diagrams here do a better job of explaining flash attention 1 as well
- Llama 2 paper: Skim it for the model details
- gpt-fast: A great repo to come back to for minimal yet performant code
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation: There's tons of papers on long context lengths but I found this to be among the clearest
- Google the different kinds of attention: cosine, dot product, cross, local, sparse, convolutional
- Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems: Wonderful survey, start here
- Efficiently Scaling transformer inference: Introduced many ideas most notably KV caches
- Making Deep Learning go Brrr from First Principles: One of the best intros to fusions and overhead
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models: The ZeRO algorithm behind FSDP and DeepSpeed intelligently reducing memory usage for data parallelism.
- Megatron-LM: For an introduction to Tensor Parallelism
- Fast Inference from Transformers via Speculative Decoding: This is the paper that helped me grok the difference in performance characteristics between prefill and autoregressive decoding
- Group Query Attention: KV caches can be chunky this is how you fix it
- Orca: A Distributed Serving System for Transformer-Based Generative Models: introduced continuous batching (great pre-read for the PagedAttention paper).
- Efficient Memory Management for Large Language Model Serving with PagedAttention: the most crucial optimization for high throughput batch inference
- A White Paper on Neural Network Quantization: Start here this is will give you the foundation to quickly skim all the other papers
- LLM.int8: All of Dettmers papers are great but this is a natural intro
- FP8 formats for deep learning: For a first hand look of how new number formats come about
- Smoothquant: Balancing rounding errors between weights and activations#
- RoFormer: Enhanced Transformer with Rotary Position Embedding: The paper that introduced rotary positional embeddings
- YaRN: Efficient Context Window Extension of Large Language Models: Extend base model context lengths with finetuning
- Ring Attention with Blockwise Transformers for Near-Infinite Context: Scale to infinite context lengths as long as you can stack more GPUs
- Venom: Vectorized N:M Format for sparse tensor cores
- Megablocks: Efficient Sparse training with mixture of experts
- ReLu Strikes Back: Activation sparsity in LLMs
- Sparse Llama
- Simple pruning for LLMs