/Meta-llama

Complete implementation of Llama2 with/without KV cache & inference πŸš€

Primary LanguagePythonMIT LicenseMIT

LLaMA

PapersπŸ“„

I am reading these papers:
βœ… LLaMA: Open and Efficient Foundation Language Models
βœ… Llama 2: Open Foundation and Fine-Tuned Chat Models
β˜‘οΈ OPT: Open Pre-trained Transformer Language Models
βœ… Attention Is All You Need
βœ… Root Mean Square Layer Normalization
βœ… GLU Variants Improve Transformer
βœ… RoFormer: Enhanced Transformer with Rotary Position Embedding
βœ… Self-Attention with Relative Position Representations
β˜‘οΈ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
β˜‘οΈ To Fold or Not to Fold: a Necessary and Sufficient Condition on Batch-Normalization Layers Folding
βœ… Fast Transformer Decoding: One Write-Head is All You Need
βœ… GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
β˜‘οΈ PaLM: Scaling Language Modeling with Pathways

Goals πŸš€

βœ… Understand the concept of dot product of two matrices.
βœ… Understand the concept of autoregressive language models.
βœ… Understand the concept of attention computation.
βœ… Understand the workings of Byte-Pair Encoding (BPE) algorithm and tokenizer.
βœ… Read and implement the workings of the SentencePiece library and tokenizer.
βœ… Understand the concept of tokenization, input ids and embedding vectors.
βœ… Understand & implement the concept of positional encoding.
βœ… Understand the concept of single head self-attention.
βœ… Understand the concept of scaled dot-product attention.
βœ… Understand & implement the concept of multi-head attention.
βœ… Understand & implement the concept of layer normalization.
βœ… Understand the concept of masked multi-head attention & softmax layer.
βœ… Understand and implement the concept of RMSNorm and difference with LayerNorm.
βœ… Understand the concept of internal covariate shift.
βœ… Understand the concept and implementation of feed-forward network with ReLU activation.
βœ… Understand the concept and implementation of feed-forward network with SwiGLU activation.
βœ… Understand the concept of absolute positional encoding.
βœ… Understand the concept of relative positional encoding.
βœ… Understand and implement the rotary positional embedding.
βœ… Understand and implement the transformer architecture.
βœ… Understand and implement the original Llama (1) architecture.
βœ… Understand the concept of multi-query attention with single KV projection.
βœ… Understand and implement grouped query attention from scratch.
βœ… Understand and implement the concept of KV cache.
βœ… Understand and implement the concept of Llama2 architecture.
βœ… Test the Llama2 implementation using the checkpoints from Meta.
βœ… Download the checkpoints of Llama2 and inspect the inference code and working.
β˜‘οΈ Documentation of the Llama2 implementation and repo.
βœ… Work on implementation of enabling and disabling the KV cache.
βœ… Add the attention mask when disabling the KV cache in Llama2.

Blog Posts:

βœ… LLAMA: OPEN AND EFFICIENT LLM NOTES
βœ… UNDERSTANDING KV CACHE
βœ… GROUPED QUERY ATTENTION (GQA)

Related GitHub Works:

🌐 pytorch-llama - PyTorch implementation of LLaMA by Umar Jamil.
🌐 pytorch-transformer - PyTorch implementation of Transformer by Umar Jamil.
🌐 llama - Facebook's LLaMA implementation.
🌐 tensor2tensor - Google's transformer implementation.
🌐 rmsnorm - RMSNorm implementation.
🌐 roformer - Rotary Tranformer implementation.
🌐 xformers - Facebook's implementation.

Articles:

βœ… Understanding SentencePiece ([Under][Standing][_Sentence][Piece])
βœ… SwiGLU: GLU Variants Improve Transformer (2020)