/llm-fasttrack

Study resources, code, and personal notes studying the theory and application of Large Language Models

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

LLM-FastTrack

This is where I'm keeping track of everything I learn about Large Language Models. It's straightforward – notes, code, and links to useful resources.

What's Inside

  • Notes: Quick thoughts, summaries, and explanations I've written down to better understand LLM concepts.
  • Code: The actual code I've written while experimenting with LLMs. It's not always pretty, but it works (mostly).
  • Resources: Links to articles, papers, and tutorials that have cleared things up for me. No fluff, just the good stuff.

My Notebooks

This is where I keep my experiments and code snippets.

Why This Repo

I needed somewhere to dump my brain as I dive into LLMs. Maybe it'll help someone else, maybe not. But it's helping me keep track of my progress and organize my thoughts.

Feel free to look around if you're into LLMs or just curious about what I'm learning. No promises, but you might find something useful.


Studyplan

This is the curriculum I'm following to learn about Large Language Models. It's a mix of PyTorch basics, LLM concepts, and real-world applications. The first draft of the study plan has been generated by a LLM and I'll be updating it as I go along.

Resources:

Category Title+Link Comment
Study How I got into deep learning Vikas Paruchuri's journey into deep learning and AI.

1. Getting Good with PyTorch


My Study Notes

Most of my notes will be in the form of notebooks, and I will link them in each section. I will also write a short summary of the key points I've learned in each section.

Before getting started

At the moment I prefer to use PyCharmPro as my dev environment. The benefits are venv- and notebook support and full IDE support (with CoPilot). If you want to run any of my code, you need to set up and activate a virtual environment and install the required packages with:

pip install -r requirements.txt

Alternatively follow these installation guides

PyTorch

I am a software engineer and already know how to code. But I am new to the PyTorch library and want to get familiar and fluent writing code with it before I dive deeper into LLMs. If you don't know how to program, I would recommend to take at least a short introductory course into Python before continuing.

If you look at the tools and libraries used to build neural networks, you'll quickly discover that there are many choices. You will also see that PyTorch is one of the most popular and upcoming libraries. To start somewhere that is the library I picked. For now I am not going to worry about other choices of the need to know them, I'll focus on PyTorch and expand later when I need to.

Vikas Paruchuri said this about proficiency: "You should get to a point where you can code up any of the main neural networks architectures in plain numpy". Since PyTorch tensors are very similar to numpy arrays, this will be my goal. And now lets get good with PyTorch tensors.

1.1. PyTorch Basics

1.1.1. PyTorch Tensors vs Numpy (Arrays)

I did a bit of searching to findout how PyTorch tensors and Numpy arrays are different and how they are similar. Here is what I found:

PyTorch tensors and NumPy arrays are both powerful tools widely used in the field of data science and machine learning, especially for array computing and handling large datasets. Despite their similarities, there are fundamental differences, especially in how they are used within the deep learning context.

Similarities between PyTorch Tensors and NumPy Arrays

  1. Data Structure: Both PyTorch tensors and NumPy arrays provide efficient data structures for storing and manipulating numerical data in multi-dimensional arrays. They offer a wide range of functionalities for array manipulations such as reshaping, slicing, and broadcasting.

  2. API Overlap and Interoperability: There is a significant overlap in the APIs between PyTorch and NumPy, making it relatively easy for users to switch between the two or to integrate them within the same project. PyTorch tensors can be easily converted to and from NumPy arrays, allowing for seamless integration between the two libraries. Functions for operations like addition, multiplication, transposition, and more, have similar calling conventions.

  3. Memory Sharing: PyTorch can interoperate with NumPy through memory sharing. Tensors can be converted to NumPy arrays and vice versa without necessarily copying data. This allows for efficient memory usage when transitioning between the two during preprocessing or analysis stages.

Differences between PyTorch Tensors and NumPy Arrays

  1. Computation Graphs and Backpropagation: PyTorch tensors are integrated with a powerful automatic differentiation library, Autograd. This makes them suitable for building neural networks where gradients are computed for optimization. NumPy, on the other hand, does not support automatic differentiation and is typically used for more straightforward numerical computations without the need for tracking gradients.

  2. GPU Support: PyTorch tensors are designed to easily switch between CPU and GPU operations, which is crucial for training deep learning models efficiently. NumPy primarily operates on the CPU, meaning operations using NumPy arrays do not benefit from GPU acceleration.

  3. Mutable vs Immutable: When a PyTorch tensor is modified, its underlying data is also modified without the need to create a new tensor. In contrast, NumPy operations often result in a new array being created even if the operation could be applied in place.

  4. Designed for Deep Learning: PyTorch is inherently designed for deep learning applications. It provides functionalities like tensor operations on GPUs, distributed computing, and more, which are specifically tailored for training neural networks. NumPy, while versatile in handling numerical data, lacks these deep learning-specific enhancements.

  5. Dynamic vs Static Computing: PyTorch allows for dynamic computational graphs, meaning the graph is built at runtime. This is beneficial for models where the computation cannot be completely described as a static graph beforehand. NumPy’s usage scenario doesn’t involve computational graphs and is purely for static array computations.

Use Cases

NumPy is excellent for tasks that require straightforward numerical computation in science and engineering but do not need gradients or massive parallelism offered by GPUs. PyTorch is preferable when developing complex models that require gradients, need to run on GPUs for performance, or when the models involve dynamic changes in the computation process.

Summary:

While PyTorch tensors and NumPy arrays share many similarities in terms of their core functionality as n-dimensional arrays, PyTorch tensors are specifically designed for deep learning and machine learning applications, with features like automatic differentiation and GPU support, which make them more suitable for these tasks compared to the more general-purpose NumPy arrays.

Conclusion:

Since we are going to get good with LLMs, PyTorch sounds just like what we need. Lets get into it in the next section.

PyTorch Tensors

I created a Jupyter Notebook 001-pytorch-tensors.ipynb that contains all of my basic experiments with PyTorch tensors.

Study Notes

I like to keep my notes in a question answering format because it helps with retrieval and interview preparation at the same time.

Question Answer
What is a tensor? A tensor is a multi-dimensional array for numerical computations that can store numerical data. (And it is very similar to a numpy array ot TensorflowTensor
What is a tensor with rank 0? A 0-dimensional tensor is a scalar that represents a single numerical value.
What is a tensor with rank 1? A 1-dimensional tensor is a vector that represents a list of numerical values.
What is a tensor with rank 2? A 2-dimensional tensor is a matrix that represents a table of numerical values.
What is broadcasting? Broadcasting is a technique in PyTorch that allows for element-wise operations between tensors of different shapes and sizes, without manually reshaping or duplicating data.
When is a PyTorch tensor "broadcastable"? Rule 1: Each tensor has at least one dimension. Rule 2: When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.
Why does the choice of data type for a tensor matter? Choosing the right one is important because it will influence the memory usage and performance

Study Resources

Category Title Comment
Coding 4.3 Vectors, Matrices, and Broadcasting A YouTube video by Sebastian Raschka
Coding Broadcasting Semantics in PyTorch Explains how/when broadcasting happens
Coding PyTorch Tensor Basics Basic tensor operations in PyTorch with explanations

II. Learning about to LLMs

  • Transformer models
  • Architecture of Transformer Models: Attention mechanisms, multi-head attention, positional encoding, feed-forward networks.
  • Pre-trained Models Overview: GPT (Generative Pre-trained Transformer), BERT (Bidirectional Encoder Representations from Transformers), and variants (RoBERTa, T5, etc.).
  • Tokenization and Embeddings: WordPiece, SentencePiece, BPE (Byte Pair Encoding), contextual embeddings.
  • Language Modeling: Unsupervised learning, predicting the next word, understanding context.
  • Evaluation Metrics: Perplexity, BLEU score, ROUGE, F1 score, accuracy, precision, recall.

Basics of Transformers

Q: What are homogenized models, and why are transformers homogenized models?
A: Homogenized models are designed to be highly adaptable across a wide range of tasks without needing specific task-oriented tuning. Transformers are considered homogenized models because they use the same model architecture to perform various NLP tasks effectively, leveraging their ability to process sequences of data in parallel and understand context without task-specific adjustments.

Q: What are foundation models in the context of transformers?
A: A foundation model is a model that has been trained on billions of records and has billions of parameters. These models can then perform a wide range of tasks without any further fine-tuning.

Q: When should one use traditional NLP methods, and when are transformers the better choice for NLP tasks?
A: Traditional NLP methods are useful when working with smaller datasets or when computational resources are limited. They are also beneficial when the tasks require simpler models that can be more easily interpreted. Transformers are better when dealing with large datasets, require understanding of context, or when the tasks benefit from deeper, more complex patterns in the data.

Q: What are transformer models in the context of Industry 4.0?
A: In Industry 4.0, transformer models are used for automating complex decision-making processes by analyzing vast amounts of data from various sources such as sensors, machines, and production lines. They enhance predictive maintenance, quality control, and supply chain management through advanced NLP and machine learning techniques.

Q: Why do we say that a feature of transformers is a high-level of homogenization?
A: Transformers exhibit a high level of homogenization because they apply the same architecture to process various types of data across multiple tasks, enabling consistent performance and facilitating machine-to-machine connections in dynamic environments like Industry 4.0.

Q: What are some examples of foundation models?
A: Examples of foundation models include GPT-3 by OpenAI, Google's BERT, Facebook’s RoBERTa, and Microsoft’s Turing-NLG.

Q: Why can it be that some models do not reach the homogenization level of foundation models?
A: Some models may not achieve the homogenization level of foundation models due to limitations in training data diversity, computational resources, or insufficient training methodologies that prevent the models from generalizing well across different tasks.

Q: What is a stochastic model, and how does that relate to LLMs?
A: A stochastic model in the context of LLMs (large language models) like Codex refers to their probabilistic nature in generating outputs. This means they use randomness in their processes to generate varied results, which can be useful for tasks like code generation where multiple correct solutions can exist.

Q: What is a sequence model?
A: A sequence model is a type of AI model that processes sequences of data, such as sentences or time series, where the order of the input data is important. It learns to predict elements in the sequence, understand context, or generate new sequences based on learned patterns.

Q: In the context of NLP, what are Markov Chains and Markov (decision) processes, what are they used for?
A: In NLP, Markov Chains are used to model the probabilities of sequences of words or phrases, assuming that the probability of each item depends only on the previous item. Markov decision processes extend this concept into decision making, where transitions between states are decided not only based on the state but also the action taken, useful in conversational agents and other sequential decision-making tasks.

Q: What are RNNs good for or used for? Give examples.
A: RNNs (Recurrent Neural Networks) are particularly good for tasks where the order and context of the input data matter, such as text generation, speech recognition, and time series prediction. They excel in handling sequences where the current input depends on the previous one.

Q: Can CNNs be applied to text? How?
A: Yes, CNNs (Convolutional Neural Networks) can be applied to text by treating segments of words or characters as spatial dimensions, similar to how they treat regions in an image. This allows them to identify patterns like word groupings and sentence fragments, useful in tasks like sentiment analysis and topic classification.

Q: What is LeNet-5 from Yann LeCun, and why is it well known?
A: LeNet-5, developed by Yann LeCun, is one of the earliest convolutional neural networks that significantly influenced the development of deep learning. It was initially designed for digit recognition and is well-known for demonstrating the effectiveness of CNNs in practical applications, leading to the broader adoption of deep learning in many fields.

Q: Why can't CNNs deal well with long term dependencies in long and/or complex sequences of text?
A: CNNs struggle with long-term dependencies in text because their convolutional filters typically capture local patterns within a fixed window size, making it difficult to maintain contextual information over longer text sequences without extensive layering or large receptive fields, which can be computationally inefficient.

Q: Are there recurrences in transformer models? A: No, recurrence has been abandoned in the transformer architecture.

Q: What type of architecture is a transformer? A: An encoder-decoder architecture.

Q: What is the encoder, and what is the decoder responsible for in a transformer? A: The encoder is responsible for creating a rich context representation of the input text, and the decoder is responsible for using that to generate the next output based on the previous outputs.

Q: What replaced the recurrence functions of RNNs, LSTMs, CNNs in transformers? A: The attention mechanism has replaced the recurrence functions.

Q: How many layers did the encoder stack of the original transformer have (from Attention is all you need paper)? A: The encoder consisted of 6 layers, each one featuring an attention sublayer and a feed forward sublayer and a normalization sublayer between.

Q: What is a difference between the encoder and decoder stack? A: The decoder stack features an additional masked multi head attention sublayer.

Q: What is multi head attention? A: Transformers have multiple (8) attention heads that can process in parallel.

Q: What do the attention mechanisms learn? A: Each attention mechanism learns different perspectives of the same input sequence.

Q: With what has recurrence been replaced in transformer models? A: The recurrence we know from RNNs and LSTMs has been replaced by the attention mechanism in transformers.

Q: Are the layers in the encoder stack of the original transformer identical? A: Yes, each layer consists of sublayers (multi-head attention + feedforward sublayer) except the first layer also has an embedding layer that combines the input embedding with positional encoding of the inputs before feeding it into the 1st layer.

Q: Why do we speak of self-attention when we talk about transformers? A: Because we are using the query against past keys. This is self-attention.

Q: What is the motivation for the architecture of the transformer model? A: To allow an industrial approach to deep learning. For the start, it perfectly fits hardware optimization requirements. For example, the stack structure of transformers allows for design of domain-specific optimized hardware that requires less floating-point precision.

Q: What is a stack in the context of transformer architectures? A: A Stack consists of n layers in the NNL. A stack can either be an encoder or a decoder. A stack runs from the bottom (layer 1) to the top (layer n). And during that process, each layer learns something that it passes on to the next layer. Similar to how human memory works.

Q: What are sublayers? A: Each layer in a stack contains sublayers. The structure of the sublayers of different layers is the same (great for hardware optimization). In the original transformer paper, the sublayers were a self-attention sublayer and a feedforward network sublayer, processed int hat order. The self-attention sublayer was specifically designed for NLP and hardware optimizations.

Q: What are attention heads? A: Each self-attention sublayer is divided into n independent and identical layers called "heads". The original transformer architecture contained 8 heads in the self-attention sublayer of every layer. Each of the heads can be processed independently of each other, ideal for parallelization.

Q: What is an autoregressive language model? A: An autoregressive language model is a type of artificial intelligence model that generates text by predicting one word at a time, based on the previous words in the sequence. This approach is called "autoregressive" because it uses its own previous outputs as inputs for future predictions.

Q: What is an example of an autoregressive language model? A: GPT-3 (Generative Pre-trained Transformer 3) by OpenAI is an example of an autoregressive language model that generates human-like text by predicting the next word in a sequence based on the context of the previous words.

Q: What is the difference between autoregressive and non-autoregressive language models?


Q: What is the MMLU Benchmark? A: The MMLU Benchmark (Massive Multi-task Language Understanding) is a comprehensive evaluation is a challenging test designed to measure a text model's multitask accuracy by evaluating models in zero-shot and few-shot settings. The MMLU serves as a standardized way to assess AI performance on tasks that range from simple math to complex legal reasoning. The MMLU contains 57 tasks across topics including elementary mathematics, US history, computer science, and law. It requires models to demonstrate a broad knowledge base and problem-solving skills.

III. Mathematical Foundations

Foundational and advanced mathematical concepts that underpin the workings of Large Language Models (LLMs), especially those based on the Transformer architecture.

Chapter Overview

  1. Linear Algebra:
    • Vectors and Matrices: Understanding the basic building blocks of neural networks, including operations like addition, multiplication, and transformation.
    • Eigenvalues and Eigenvectors: Importance in understanding how neural networks learn and how data can be transformed.
    • Special Matrices: Identity matrices, diagonal matrices, and their properties relevant to neural network optimizations.

3.1. Linear Algebra

Study Notes

This is an overview and/or review of some of the basic concepts in linear Algebra.

3.1.1. The Geometry of Linear Equations

Idea: We are looking for a solution of a system of linear equations. For that we are expressing the equations as row vectors in a matrix. Then the solution is the to all equations is the vector x in A*x = b.

There are different ways of looking at the matrix and the vectors involved in the Ax = b equation:

Row picture of a matrix

Row picture of a matrix

Linear combinations

Linear combination of column vectors

Column picture of a matrix

Column picture of a matrix

Matrix-Vector multiplication

Matrix-vector-multiplication

You can do it column based: Take 1 of the first column and add 2 of the 2nd column

You can do it row based (dot product): Dot product of the first row of A with the vector + dot product of the second of of A with the vector.

Question Answer
Do the linear combinations of the columns fill n-dimensional space? This is the same question as: Does Ax=b always have a solution for x?
Is there always a solution for x in Ax=b? Yes, if A is invertible. Yes, if A is non-singular.
Are invertible matrices always non-singular matrices? Yes.
What is the definition of a singular matrix? A matrix is singular if it does not have an inverse.
What can you tell about a matrix if its determinant is zero? That the matrix has linear dependent row or column vectors
When is a matrix not invertible? When it has linear dependent row or column vectors
What does the determinant tell us about a matrix? When it is zero the matrix is not invertible. When it is not zero the row and column vectors are linearly independent.
What is the definition of an invertible matrix? A is invertible if A^-1^ exist such that A*A^-1^=I.
What are some methods that can be used to find the inverse of a matrix a) Gaussian Elemination (Row Reduction) b) Matrix Decomposition techniques: LU-decomposition, QR-decomposition, singular value decomposition (SVD).
When a matrix is invertible, how many solutions can exist for x in Ax=b? x will always have exactly one solution.
When a matrix is singular, how many solutions can exist for x in Ax=b? x can have 0 or infinitely many solutions, but never exactly one.
How can Gaussian Elimination fail? It can fail primarily due to zero pivots that cannot be replaced by row swaps. This often occurs when there is linear dependence among the rows, leading either to no solution (inconsistent system) or to a system with infinitely many solutions (underdetermined system).
What does it mean when we find a zero pivot during Gaussian Elimination? That we have linear dependent rows or columns. Meaning there are either zero or infinitely many solutions to the system of equations.

Study Resources

Category Title Comment
Math MIC OCW Linear Algebra, Fall 2011 by Prof. Gilbert Strang
Math MIT OCW Lecture 1 The Geometry of Linear Equations by Prof. Gilbert Strang
Math MIT OCW Lecture 2 Elimination with Matrices by Prof. Gilbert Strang
Math MIT OCW Lecture 3 Multiplication and Inverse Matrices by Prof. Gilbert Strang

IV. Fine-Tuning and Optimising LLMs

  • Fine-Tuning Techniques: Transfer learning, learning rate adjustment, layer freezing/unfreezing, gradual unfreezing.
  • Optimization Algorithms: Adam, RMSprop, SGD, learning rate schedulers.
  • Regularization and Generalization: Dropout, weight decay, batch normalization, early stopping.
  • Efficiency and Scalability: Mixed precision training, model parallelism, data parallelism, distributed training.
  • Model Size Reduction: Quantization, pruning, knowledge distillation.

V. RAG: Retrieval-Augmented Generation

  • Introduction to RAG: Concept, architecture, comparison with traditional LLMs.
  • Retrieval Mechanisms: Dense Vector Retrieval, BM25, using external knowledge bases.
  • Integrating RAG with LLMs: Fine-tuning RAG models, customizing retrieval components.
  • Applications of RAG: Question answering, fact checking, content generation with external references.
  • Challenges and Solutions: Handling out-of-date knowledge, bias in retrieved documents, improving retrieval relevance.

VI. Developing real-world Applications with LLMs

  • Integrating LLMs into Applications: API development, deploying models with Flask/Django for web applications, mobile app integration.
  • User Interface and Experience: Chatbots, virtual assistants, generating human-like text, handling user inputs.
  • Security and Scalability: Authentication, authorization, load balancing, caching.
  • Monitoring and Maintenance: Logging, error handling, continuous integration and deployment (CI/CD) pipelines.
  • Case Studies and Project Ideas: Content generation, summarization, translation, sentiment analysis, automated customer service.

Terms and Concepts (uncategorized)

Keyword Explanation Links
Temperature affects the randomness of the model's output by scaling the logits before applying softmax, influencing the model's "creativity" or certainty in its predictions. Lower temperatures lead to more deterministic outputs, while higher temperatures increase diversity and creativity. Peter Chng
Top P (Nucleus Sampling) selects a subset of likely outcomes by ensuring the cumulative probability exceeds a threshold p, allowing for adaptive and context-sensitive text generation. This method focuses on covering a certain amount of probability mass. Peter Chng
Top K limits the selection pool to the K most probable next words, reducing randomness by excluding less likely predictions from consideration. This method normalizes the probabilities of the top K tokens to sample the next token. Peter Chng
Q (Query) represents the input tokens being compared against key-value pairs in attention mechanisms, facilitating the model's focus on different parts of the input sequence for predictions.
K (Key) represents the tokens used to compute the amount of attention that input tokens should pay to the corresponding values, crucial for determining focus areas in the model's attention mechanism.
V (Value) is the content that is being attended to, enriched through the attention mechanism with information from the key, indicating the actual information the model focuses on during processing.
Embeddings are high-dimensional representations of tokens that capture semantic meanings, allowing models to process words or tokens by encapsulating both syntactic and semantic information.
Tokenizers are tools that segment text into manageable pieces for processing by models, with different algorithms affecting model performance and output quality.
Rankers are algorithms used to order documents or predict their relevance to a query, influencing the selection of next words or sentences based on certain criteria in NLP applications.

Advice

A collection of quotes, advice, and tips that I've found helpful in my learning journey.

Category Advice Source
Study Study like there is nothing else to do in your life. Create a plan and stick to it no matter what. No change of directions and no second thoughts. DL Insider
Study Join Discord communities where the latest (state of the art) papers and models are discussed Vikas Paruchuri
Study Despite transformers, CNNs are still widely used, and everything old is new again with RNNs. Vikas Paruchuri
Study Learn from examples and create things along the path. DL Insider
Study It can take years of hard study to master ML/DL math. And in the end it will help you only in 15% of the cases ... or less. DL Insider
Study Is is much easier to understand the models from an engineering perspective and then fill the gaps with math. DL Insider
Study It is much easier to learn ML as an SWE than the other way around Greg Brockman
Coding You should get to a point where you can code up any of the main neural networks architectures in plain numpy (forward and backward passes) Vikas Paruchuri
Training LLMs The easiest entrypoint for training models these days is fine-tuning a base model. Huggingface transformers is great for finetuning because it implements a lot of models already, and uses PyTorch. Vikas Paruchuri
Training LLMs The easiest way to finetune is to pick a small model (7B or fewer params), and try fine-tuning with LoRA. Vikas Paruchuri
Training LLMs Understanding the fundamentals is important to training good models Vikas Paruchuri
Training LLMs You don’t need a lot of GPUs for fine-tuning Vikas Paruchuri
Impact Finetuning is a very crowded space, and it’s hard to make an impact when the state of the art changes every day. Vikas Paruchuri
Impact Finding interesting problems to solve is the best way to make an impact with what you build Vikas Paruchuri
Impact There are many niches in AI where you can make a big impact, even as a relative outsider. Vikas Paruchuri
Impact Focus on the system you need, not the one you like. You will have to be able to use many different resources (Google, HuggingFace, OpenAI, etc.)

Reading List

In this section I keep track of all the articles, papers, and tutorials I am reading to learn about LLMs.

Next Up:

Inbox:

Archive

DeepMind's "Chinchilla" paper presents several key results and findings centered around the scaling laws in language model training:

  • The paper argues that current models (in 2022) are significantly under-trained (meaning they have too many parameters for how long they've been trained, assuming the datasets were of high quality).
  • The paper discusses how to estimate the optimal model size and number of tokens for training dense autoregressive models.
  • In the paper these ideas were tested by training a 70B parameter model "Chinchilla" with different number of tokens and compare the performance of the models on a range of natural language reasoning and understanding tasks, demonstrating better performance than larger models that were under-trained for their size.
  • The authors suggest doubling the number of tokens approximately linear with the increase in the size of the model for a fixed amount of compute.
  • The paper presents three techniques for determining the number of tokens or the size of the model dependent on the amount of (fixed) available compute (FLOPS):
  1. Fixed model size -> scale the number of training tokens and with compute.
  2. Fixed number of training tokens -> scale the model size with compute .
  3. Fixed compute -> scale the number of tokens with size of the model.

Key Takeaways:

  • Increasing the amount of training data is more efficient than increasing model size when both are constrained by compute resources.
  • It's more effective to train a slightly smaller model on more data rather than a larger model on fewer data:
  1. Smaller models are more cost-effective during training
  2. Smaller models are cheaper to run during inference
  3. A model with better performance on a range of tasks.

Resources

Free ML Training Resources:

Discord Servers: