/llama-from-scratch

Implementation of Llama-3 Model from Scratch

Primary LanguageJupyter Notebook

Implementation of Llama-3 Model from Scratch

Introduction

This Jupyter Notebook presents an implementation of the Llama-3 language model from scratch, focusing on the step-by-step process of tensor and matrix multiplication operations. The implementation utilizes the pre-trained weights provided by Meta for the Llama-3 model. Before running the code, it is necessary to download the weights file from the official link: https://llama.meta.com/llama-downloads/.

Tokenizer

The implementation uses the tokenizer provided by Meta for the Llama-3 model. The tokenizer is responsible for converting the input text into a sequence of tokens, which are numerical representations of words or subword units.

Loading Model Parameters

Instead of instantiating a model class and loading the weights into corresponding variables, this implementation loads the model parameters directly from the pre-trained weights file, one tensor at a time. This approach allows for a more granular understanding of the model's architecture and the role of each tensor in the computation.

Analyzing Model Configuration

The loaded configuration dictionary provides insights into the architecture of the Llama-3 model, such as the number of transformer layers, attention heads, vocabulary size, and other hyperparameters. These configuration values help in understanding the model's capacity, complexity, and the specific architectural choices made during its design and training.

Converting Text to Tokens

The input text is converted into a sequence of tokens using the tiktoken library, which is a tokenizer developed by OpenAI. The tokenizer breaks down the text into individual tokens, which are then represented as numerical values.

Embedding Tokens

The token sequence is converted into a sequence of embeddings using the nn.Embedding module from PyTorch. Each token is mapped to a dense vector representation of length 4096, capturing the semantic and contextual information associated with that token.

RMS Normalization

The token embeddings undergo RMS (Root Mean Square) normalization to prevent numerical instabilities during subsequent computations. The normalization process ensures that the embeddings have appropriate scales and distributions.

Building the Initial Transformer Layer

The implementation focuses on constructing the initial transformer layer, which consists of several components such as normalization, attention, and feed-forward networks.

Normalization

The first step in building the transformer layer is to normalize the token embeddings using the weights from the pre-trained model.

Implementing Attention from Scratch

The attention mechanism is implemented from scratch, starting with separating the query, key, and value vectors for each attention head. The attention scores are computed by performing matrix multiplications between the query and key vectors, followed by scaling and softmax operations.

Positional Encoding

To incorporate positional information into the attention mechanism, rotary positional embeddings (RoPE) are utilized. The query and key vectors are rotated based on their positions in the input sequence, allowing the model to capture the relative positions of tokens.

Masking and Softmax

Future token scores are masked to ensure that the model's predictions are based only on the preceding context. The masked attention scores undergo a softmax operation to obtain the final attention weights.

Generating Value Vectors

Value vectors are computed by multiplying the token embeddings with the value weight matrix. These value vectors are then weighted by the attention scores to produce the attention output.

Multi-Head Attention

The attention mechanism is applied across multiple heads, allowing the model to capture different aspects of the input simultaneously. The attention outputs from all heads are concatenated to form a single attention representation.

Feed-Forward Network

After the attention layer, a position-wise feed-forward network is applied to introduce non-linearity and capture complex relationships between the input embeddings. The Llama-3 model utilizes a SwiGLU (Swish-Gated Linear Unit) activation function in the feed-forward network.

Processing Subsequent Layers

The implementation extends the processing to the subsequent transformer layers, iterating over each layer and applying the attention and feed-forward networks sequentially. The embeddings are updated at each layer, incorporating increasingly complex interactions and dependencies.

Decoding the Final Embedding

The final embedding obtained after processing all the transformer layers represents the model's prediction for the next token. This embedding is decoded into the actual token value using the output decoder layer of the model.

Conclusion

This Notebook provides an overview of the step-by-step implementation of the Llama-3 language model from scratch, focusing on the key components such as tokenization, embedding, attention, positional encoding, and feed-forward networks. By implementing the model from the ground up, this project aims to deepen the understanding of the inner workings of the LLaMA-3 architecture and the transformations applied to the input data at each stage.

The implementation showcases the power and flexibility of the transformer architecture in capturing complex language patterns and generating meaningful predictions. It serves as a valuable resource for those interested in understanding the intricacies of language modeling and exploring the capabilities of the LLaMA-3 model.