LLM-FastTrack

This is where I'm keeping track of everything I learn about Large Language Models. It's straightforward – notes, code, and links to useful resources.

What's Inside

Notes: Quick thoughts, summaries, and explanations I've written down to better understand LLM concepts.
Code: The actual code I've written while experimenting with LLMs. It's not always pretty, but it works (mostly).
Resources: Links to articles, papers, and tutorials that have cleared things up for me. No fluff, just the good stuff.

My Notebooks

This is where I keep my experiments and code snippets.

001-pytorch-tensors.ipynb - Basic experiments with PyTorch tensors.
002-pytorch-neuron.ipynb - Implementation of a single neuron from scratch and training it on the Iris dataset
003-pytorch-transformers.ipynb - Implementation of a transformer model from scratch
004-pytorch-data-science-basics.ipynb - Basic data science operations and ML with PyTorch

Why This Repo

I needed somewhere to dump my brain as I dive into LLMs. Maybe it'll help someone else, maybe not. But it's helping me keep track of my progress and organize my thoughts.

Feel free to look around if you're into LLMs or just curious about what I'm learning. No promises, but you might find something useful.

Studyplan

This is the curriculum I'm following to learn about Large Language Models. It's a mix of PyTorch basics, LLM concepts, and real-world applications. The first draft of the study plan has been generated by a LLM and I'll be updating it as I go along.

Resources:

Category	Title+Link	Comment
Study	How I got into deep learning	Vikas Paruchuri's journey into deep learning and AI.

1. Getting Good with PyTorch

My Study Notes

Most of my notes will be in the form of notebooks, and I will link them in each section. I will also write a short summary of the key points I've learned in each section.

Before getting started

At the moment I prefer to use PyCharmPro as my dev environment. The benefits are venv- and notebook support and full IDE support (with CoPilot). If you want to run any of my code, you need to set up and activate a virtual environment and install the required packages with:

pip install -r requirements.txt

Alternatively follow these installation guides

PyTorch

I am a software engineer and already know how to code. But I am new to the PyTorch library and want to get familiar and fluent writing code with it before I dive deeper into LLMs. If you don't know how to program, I would recommend to take at least a short introductory course into Python before continuing.

If you look at the tools and libraries used to build neural networks, you'll quickly discover that there are many choices. You will also see that PyTorch is one of the most popular and upcoming libraries. To start somewhere that is the library I picked. For now I am not going to worry about other choices of the need to know them, I'll focus on PyTorch and expand later when I need to.

Vikas Paruchuri said this about proficiency: "You should get to a point where you can code up any of the main neural networks architectures in plain numpy". Since PyTorch tensors are very similar to numpy arrays, this will be my goal. And now lets get good with PyTorch tensors.

1.1. PyTorch Basics

1.1.1. PyTorch Tensors vs Numpy (Arrays)

I did a bit of searching to findout how PyTorch tensors and Numpy arrays are different and how they are similar. Here is what I found:

PyTorch tensors and NumPy arrays are both powerful tools widely used in the field of data science and machine learning, especially for array computing and handling large datasets. Despite their similarities, there are fundamental differences, especially in how they are used within the deep learning context.

Similarities between PyTorch Tensors and NumPy Arrays

Data Structure: Both PyTorch tensors and NumPy arrays provide efficient data structures for storing and manipulating numerical data in multi-dimensional arrays. They offer a wide range of functionalities for array manipulations such as reshaping, slicing, and broadcasting.
API Overlap and Interoperability: There is a significant overlap in the APIs between PyTorch and NumPy, making it relatively easy for users to switch between the two or to integrate them within the same project. PyTorch tensors can be easily converted to and from NumPy arrays, allowing for seamless integration between the two libraries. Functions for operations like addition, multiplication, transposition, and more, have similar calling conventions.
Memory Sharing: PyTorch can interoperate with NumPy through memory sharing. Tensors can be converted to NumPy arrays and vice versa without necessarily copying data. This allows for efficient memory usage when transitioning between the two during preprocessing or analysis stages.

Differences between PyTorch Tensors and NumPy Arrays

Computation Graphs and Backpropagation: PyTorch tensors are integrated with a powerful automatic differentiation library, Autograd. This makes them suitable for building neural networks where gradients are computed for optimization. NumPy, on the other hand, does not support automatic differentiation and is typically used for more straightforward numerical computations without the need for tracking gradients.
GPU Support: PyTorch tensors are designed to easily switch between CPU and GPU operations, which is crucial for training deep learning models efficiently. NumPy primarily operates on the CPU, meaning operations using NumPy arrays do not benefit from GPU acceleration.
Mutable vs Immutable: When a PyTorch tensor is modified, its underlying data is also modified without the need to create a new tensor. In contrast, NumPy operations often result in a new array being created even if the operation could be applied in place.
Designed for Deep Learning: PyTorch is inherently designed for deep learning applications. It provides functionalities like tensor operations on GPUs, distributed computing, and more, which are specifically tailored for training neural networks. NumPy, while versatile in handling numerical data, lacks these deep learning-specific enhancements.
Dynamic vs Static Computing: PyTorch allows for dynamic computational graphs, meaning the graph is built at runtime. This is beneficial for models where the computation cannot be completely described as a static graph beforehand. NumPy’s usage scenario doesn’t involve computational graphs and is purely for static array computations.

Use Cases

NumPy is excellent for tasks that require straightforward numerical computation in science and engineering but do not need gradients or massive parallelism offered by GPUs. PyTorch is preferable when developing complex models that require gradients, need to run on GPUs for performance, or when the models involve dynamic changes in the computation process.

Summary:

While PyTorch tensors and NumPy arrays share many similarities in terms of their core functionality as n-dimensional arrays, PyTorch tensors are specifically designed for deep learning and machine learning applications, with features like automatic differentiation and GPU support, which make them more suitable for these tasks compared to the more general-purpose NumPy arrays.

Conclusion:

Since we are going to get good with LLMs, PyTorch sounds just like what we need. Lets get into it in the next section.

PyTorch Tensors

I created a Jupyter Notebook 001-pytorch-tensors.ipynb that contains all of my basic experiments with PyTorch tensors.

Study Notes

I like to keep my notes in a question answering format because it helps with retrieval and interview preparation at the same time.

Question	Answer
What is a tensor?	A tensor is a multi-dimensional array for numerical computations that can store numerical data. (And it is very similar to a numpy array ot TensorflowTensor
What is a tensor with rank 0?	A 0-dimensional tensor is a scalar that represents a single numerical value.
What is a tensor with rank 1?	A 1-dimensional tensor is a vector that represents a list of numerical values.
What is a tensor with rank 2?	A 2-dimensional tensor is a matrix that represents a table of numerical values.
What is broadcasting?	Broadcasting is a technique in PyTorch that allows for element-wise operations between tensors of different shapes and sizes, without manually reshaping or duplicating data.
When is a PyTorch tensor "broadcastable"?	Rule 1: Each tensor has at least one dimension. Rule 2: When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.
Why does the choice of data type for a tensor matter?	Choosing the right one is important because it will influence the memory usage and performance

Study Resources

Category	Title	Comment
Coding	4.3 Vectors, Matrices, and Broadcasting	A YouTube video by Sebastian Raschka
Coding	Broadcasting Semantics in PyTorch	Explains how/when broadcasting happens
Coding	PyTorch Tensor Basics	Basic tensor operations in PyTorch with explanations

II. Learning about to LLMs

Transformer models
Architecture of Transformer Models: Attention mechanisms, multi-head attention, positional encoding, feed-forward networks.
Pre-trained Models Overview: GPT (Generative Pre-trained Transformer), BERT (Bidirectional Encoder Representations from Transformers), and variants (RoBERTa, T5, etc.).
Tokenization and Embeddings: WordPiece, SentencePiece, BPE (Byte Pair Encoding), contextual embeddings.
Language Modeling: Unsupervised learning, predicting the next word, understanding context.
Evaluation Metrics: Perplexity, BLEU score, ROUGE, F1 score, accuracy, precision, recall.

Basics of Transformers

Q: What are homogenized models, and why are transformers homogenized models?
A: Homogenized models are designed to be highly adaptable across a wide range of tasks without needing specific task-oriented tuning. Transformers are considered homogenized models because they use the same model architecture to perform various NLP tasks effectively, leveraging their ability to process sequences of data in parallel and understand context without task-specific adjustments.

Q: What are foundation models in the context of transformers?
A: A foundation model is a model that has been trained on billions of records and has billions of parameters. These models can then perform a wide range of tasks without any further fine-tuning.

Q: When should one use traditional NLP methods, and when are transformers the better choice for NLP tasks?
A: Traditional NLP methods are useful when working with smaller datasets or when computational resources are limited. They are also beneficial when the tasks require simpler models that can be more easily interpreted. Transformers are better when dealing with large datasets, require understanding of context, or when the tasks benefit from deeper, more complex patterns in the data.

Q: What are transformer models in the context of Industry 4.0?
A: In Industry 4.0, transformer models are used for automating complex decision-making processes by analyzing vast amounts of data from various sources such as sensors, machines, and production lines. They enhance predictive maintenance, quality control, and supply chain management through advanced NLP and machine learning techniques.

Q: Why do we say that a feature of transformers is a high-level of homogenization?
A: Transformers exhibit a high level of homogenization because they apply the same architecture to process various types of data across multiple tasks, enabling consistent performance and facilitating machine-to-machine connections in dynamic environments like Industry 4.0.

Q: What are some examples of foundation models?
A: Examples of foundation models include GPT-3 by OpenAI, Google's BERT, Facebook’s RoBERTa, and Microsoft’s Turing-NLG.

Q: Why can it be that some models do not reach the homogenization level of foundation models?
A: Some models may not achieve the homogenization level of foundation models due to limitations in training data diversity, computational resources, or insufficient training methodologies that prevent the models from generalizing well across different tasks.

Q: What is a stochastic model, and how does that relate to LLMs?
A: A stochastic model in the context of LLMs (large language models) like Codex refers to their probabilistic nature in generating outputs. This means they use randomness in their processes to generate varied results, which can be useful for tasks like code generation where multiple correct solutions can exist.

Q: What is a sequence model?
A: A sequence model is a type of AI model that processes sequences of data, such as sentences or time series, where the order of the input data is important. It learns to predict elements in the sequence, understand context, or generate new sequences based on learned patterns.

Q: In the context of NLP, what are Markov Chains and Markov (decision) processes, what are they used for?
A: In NLP, Markov Chains are used to model the probabilities of sequences of words or phrases, assuming that the probability of each item depends only on the previous item. Markov decision processes extend this concept into decision making, where transitions between states are decided not only based on the state but also the action taken, useful in conversational agents and other sequential decision-making tasks.

Q: What are RNNs good for or used for? Give examples.
A: RNNs (Recurrent Neural Networks) are particularly good for tasks where the order and context of the input data matter, such as text generation, speech recognition, and time series prediction. They excel in handling sequences where the current input depends on the previous one.

Q: Can CNNs be applied to text? How?
A: Yes, CNNs (Convolutional Neural Networks) can be applied to text by treating segments of words or characters as spatial dimensions, similar to how they treat regions in an image. This allows them to identify patterns like word groupings and sentence fragments, useful in tasks like sentiment analysis and topic classification.

Q: What is LeNet-5 from Yann LeCun, and why is it well known?
A: LeNet-5, developed by Yann LeCun, is one of the earliest convolutional neural networks that significantly influenced the development of deep learning. It was initially designed for digit recognition and is well-known for demonstrating the effectiveness of CNNs in practical applications, leading to the broader adoption of deep learning in many fields.

Q: Why can't CNNs deal well with long term dependencies in long and/or complex sequences of text?
A: CNNs struggle with long-term dependencies in text because their convolutional filters typically capture local patterns within a fixed window size, making it difficult to maintain contextual information over longer text sequences without extensive layering or large receptive fields, which can be computationally inefficient.

Q: Are there recurrences in transformer models? A: No, recurrence has been abandoned in the transformer architecture.

Q: What type of architecture is a transformer? A: An encoder-decoder architecture.

Q: What is the encoder, and what is the decoder responsible for in a transformer? A: The encoder is responsible for creating a rich context representation of the input text, and the decoder is responsible for using that to generate the next output based on the previous outputs.

Q: What replaced the recurrence functions of RNNs, LSTMs, CNNs in transformers? A: The attention mechanism has replaced the recurrence functions.

Q: How many layers did the encoder stack of the original transformer have (from Attention is all you need paper)? A: The encoder consisted of 6 layers, each one featuring an attention sublayer and a feed forward sublayer and a normalization sublayer between.

Q: What is a difference between the encoder and decoder stack? A: The decoder stack features an additional masked multi head attention sublayer.

Q: What is multi head attention? A: Transformers have multiple (8) attention heads that can process in parallel.

Q: What do the attention mechanisms learn? A: Each attention mechanism learns different perspectives of the same input sequence.

Q: With what has recurrence been replaced in transformer models? A: The recurrence we know from RNNs and LSTMs has been replaced by the attention mechanism in transformers.

Q: Are the layers in the encoder stack of the original transformer identical? A: Yes, each layer consists of sublayers (multi-head attention + feedforward sublayer) except the first layer also has an embedding layer that combines the input embedding with positional encoding of the inputs before feeding it into the 1st layer.

Q: Why do we speak of self-attention when we talk about transformers? A: Because we are using the query against past keys. This is self-attention.

Q: What is the motivation for the architecture of the transformer model? A: To allow an industrial approach to deep learning. For the start, it perfectly fits hardware optimization requirements. For example, the stack structure of transformers allows for design of domain-specific optimized hardware that requires less floating-point precision.

Q: What is a stack in the context of transformer architectures? A: A Stack consists of n layers in the NNL. A stack can either be an encoder or a decoder. A stack runs from the bottom (layer 1) to the top (layer n). And during that process, each layer learns something that it passes on to the next layer. Similar to how human memory works.

Q: What are sublayers? A: Each layer in a stack contains sublayers. The structure of the sublayers of different layers is the same (great for hardware optimization). In the original transformer paper, the sublayers were a self-attention sublayer and a feedforward network sublayer, processed int hat order. The self-attention sublayer was specifically designed for NLP and hardware optimizations.

Q: What are attention heads? A: Each self-attention sublayer is divided into n independent and identical layers called "heads". The original transformer architecture contained 8 heads in the self-attention sublayer of every layer. Each of the heads can be processed independently of each other, ideal for parallelization.

Q: What is an autoregressive language model? A: An autoregressive language model is a type of artificial intelligence model that generates text by predicting one word at a time, based on the previous words in the sequence. This approach is called "autoregressive" because it uses its own previous outputs as inputs for future predictions.

Q: What is an example of an autoregressive language model? A: GPT-3 (Generative Pre-trained Transformer 3) by OpenAI is an example of an autoregressive language model that generates human-like text by predicting the next word in a sequence based on the context of the previous words.

Q: What is the difference between autoregressive and non-autoregressive language models?

Q: What is the MMLU Benchmark? A: The MMLU Benchmark (Massive Multi-task Language Understanding) is a comprehensive evaluation is a challenging test designed to measure a text model's multitask accuracy by evaluating models in zero-shot and few-shot settings. The MMLU serves as a standardized way to assess AI performance on tasks that range from simple math to complex legal reasoning. The MMLU contains 57 tasks across topics including elementary mathematics, US history, computer science, and law. It requires models to demonstrate a broad knowledge base and problem-solving skills.

III. Mathematical Foundations

Foundational and advanced mathematical concepts that underpin the workings of Large Language Models (LLMs), especially those based on the Transformer architecture.

Chapter Overview

Linear Algebra:
- Vectors and Matrices: Understanding the basic building blocks of neural networks, including operations like addition, multiplication, and transformation.
- Eigenvalues and Eigenvectors: Importance in understanding how neural networks learn and how data can be transformed.
- Special Matrices: Identity matrices, diagonal matrices, and their properties relevant to neural network optimizations.

3.1. Linear Algebra

Study Notes

This is an overview and/or review of some of the basic concepts in linear Algebra.

3.1.1. The Geometry of Linear Equations

Idea: We are looking for a solution of a system of linear equations. For that we are expressing the equations as row vectors in a matrix. Then the solution is the to all equations is the vector x in A*x = b.

There are different ways of looking at the matrix and the vectors involved in the Ax = b equation:

Row picture of a matrix

Linear combinations

Column picture of a matrix

Matrix-Vector multiplication

You can do it column based: Take 1 of the first column and add 2 of the 2nd column

You can do it row based (dot product): Dot product of the first row of A with the vector + dot product of the second of of A with the vector.

Question	Answer
Do the linear combinations of the columns fill n-dimensional space?	This is the same question as: Does Ax=b always have a solution for x?
Is there always a solution for x in Ax=b?	Yes, if A is invertible. Yes, if A is non-singular.
Are invertible matrices always non-singular matrices?	Yes.
What is the definition of a singular matrix?	A matrix is singular if it does not have an inverse.
What can you tell about a matrix if its determinant is zero?	That the matrix has linear dependent row or column vectors
When is a matrix not invertible?	When it has linear dependent row or column vectors
What does the determinant tell us about a matrix?	When it is zero the matrix is not invertible. When it is not zero the row and column vectors are linearly independent.
What is the definition of an invertible matrix?	A is invertible if A^-1^ exist such that A*A^-1^=I.
What are some methods that can be used to find the inverse of a matrix	a) Gaussian Elemination (Row Reduction) b) Matrix Decomposition techniques: LU-decomposition, QR-decomposition, singular value decomposition (SVD).
When a matrix is invertible, how many solutions can exist for x in Ax=b?	x will always have exactly one solution.
When a matrix is singular, how many solutions can exist for x in Ax=b?	x can have 0 or infinitely many solutions, but never exactly one.
How can Gaussian Elimination fail?	It can fail primarily due to zero pivots that cannot be replaced by row swaps. This often occurs when there is linear dependence among the rows, leading either to no solution (inconsistent system) or to a system with infinitely many solutions (underdetermined system).
What does it mean when we find a zero pivot during Gaussian Elimination?	That we have linear dependent rows or columns. Meaning there are either zero or infinitely many solutions to the system of equations.

Study Resources

Category	Title	Comment
Math	MIC OCW Linear Algebra, Fall 2011	by Prof. Gilbert Strang
Math	MIT OCW Lecture 1 The Geometry of Linear Equations	by Prof. Gilbert Strang
Math	MIT OCW Lecture 2 Elimination with Matrices	by Prof. Gilbert Strang
Math	MIT OCW Lecture 3 Multiplication and Inverse Matrices	by Prof. Gilbert Strang

IV. Fine-Tuning and Optimising LLMs

Fine-Tuning Techniques: Transfer learning, learning rate adjustment, layer freezing/unfreezing, gradual unfreezing.
Optimization Algorithms: Adam, RMSprop, SGD, learning rate schedulers.
Regularization and Generalization: Dropout, weight decay, batch normalization, early stopping.
Efficiency and Scalability: Mixed precision training, model parallelism, data parallelism, distributed training.
Model Size Reduction: Quantization, pruning, knowledge distillation.

V. RAG: Retrieval-Augmented Generation

Introduction to RAG: Concept, architecture, comparison with traditional LLMs.
Retrieval Mechanisms: Dense Vector Retrieval, BM25, using external knowledge bases.
Integrating RAG with LLMs: Fine-tuning RAG models, customizing retrieval components.
Applications of RAG: Question answering, fact checking, content generation with external references.
Challenges and Solutions: Handling out-of-date knowledge, bias in retrieved documents, improving retrieval relevance.

VI. Developing real-world Applications with LLMs

Integrating LLMs into Applications: API development, deploying models with Flask/Django for web applications, mobile app integration.
User Interface and Experience: Chatbots, virtual assistants, generating human-like text, handling user inputs.
Security and Scalability: Authentication, authorization, load balancing, caching.
Monitoring and Maintenance: Logging, error handling, continuous integration and deployment (CI/CD) pipelines.
Case Studies and Project Ideas: Content generation, summarization, translation, sentiment analysis, automated customer service.

Terms and Concepts (uncategorized)

Keyword	Explanation	Links
Temperature	affects the randomness of the model's output by scaling the logits before applying softmax, influencing the model's "creativity" or certainty in its predictions. Lower temperatures lead to more deterministic outputs, while higher temperatures increase diversity and creativity.	Peter Chng
Top P (Nucleus Sampling)	selects a subset of likely outcomes by ensuring the cumulative probability exceeds a threshold p, allowing for adaptive and context-sensitive text generation. This method focuses on covering a certain amount of probability mass.	Peter Chng
Top K	limits the selection pool to the K most probable next words, reducing randomness by excluding less likely predictions from consideration. This method normalizes the probabilities of the top K tokens to sample the next token.	Peter Chng
Q (Query)	represents the input tokens being compared against key-value pairs in attention mechanisms, facilitating the model's focus on different parts of the input sequence for predictions.
K (Key)	represents the tokens used to compute the amount of attention that input tokens should pay to the corresponding values, crucial for determining focus areas in the model's attention mechanism.
V (Value)	is the content that is being attended to, enriched through the attention mechanism with information from the key, indicating the actual information the model focuses on during processing.
Embeddings	are high-dimensional representations of tokens that capture semantic meanings, allowing models to process words or tokens by encapsulating both syntactic and semantic information.
Tokenizers	are tools that segment text into manageable pieces for processing by models, with different algorithms affecting model performance and output quality.
Rankers	are algorithms used to order documents or predict their relevance to a query, influencing the selection of next words or sentences based on certain criteria in NLP applications.

Advice

A collection of quotes, advice, and tips that I've found helpful in my learning journey.

Category	Advice	Source
Study	Study like there is nothing else to do in your life. Create a plan and stick to it no matter what. No change of directions and no second thoughts.	DL Insider
Study	Join Discord communities where the latest (state of the art) papers and models are discussed	Vikas Paruchuri
Study	Despite transformers, CNNs are still widely used, and everything old is new again with RNNs.	Vikas Paruchuri
Study	Learn from examples and create things along the path.	DL Insider
Study	It can take years of hard study to master ML/DL math. And in the end it will help you only in 15% of the cases ... or less.	DL Insider
Study	Is is much easier to understand the models from an engineering perspective and then fill the gaps with math.	DL Insider
Study	It is much easier to learn ML as an SWE than the other way around	Greg Brockman
Coding	You should get to a point where you can code up any of the main neural networks architectures in plain numpy (forward and backward passes)	Vikas Paruchuri
Training LLMs	The easiest entrypoint for training models these days is fine-tuning a base model. Huggingface transformers is great for finetuning because it implements a lot of models already, and uses PyTorch.	Vikas Paruchuri
Training LLMs	The easiest way to finetune is to pick a small model (7B or fewer params), and try fine-tuning with LoRA.	Vikas Paruchuri
Training LLMs	Understanding the fundamentals is important to training good models	Vikas Paruchuri
Training LLMs	You don’t need a lot of GPUs for fine-tuning	Vikas Paruchuri
Impact	Finetuning is a very crowded space, and it’s hard to make an impact when the state of the art changes every day.	Vikas Paruchuri
Impact	Finding interesting problems to solve is the best way to make an impact with what you build	Vikas Paruchuri
Impact	There are many niches in AI where you can make a big impact, even as a relative outsider.	Vikas Paruchuri
Impact	Focus on the system you need, not the one you like. You will have to be able to use many different resources (Google, HuggingFace, OpenAI, etc.)

Reading List

In this section I keep track of all the articles, papers, and tutorials I am reading to learn about LLMs.

Next Up:

Inbox:

18.06SC | Fall 2011 | Linear Algebra lectures by Gilbert Strang
by Chip Huyen
A curated list of resources dedicated to Natural Language Processing
A prompt for your thoughts: Prompt-based learning
AllenNLP A library and platform that provides transformer models for various NLP tasks. (Has a free EDU API)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: Vision Transformer
An Overview for Text Representations in NLP
Applied AI: Building NLP Apps with Hugging Face Transformers
Attention in transformers, visually explained: Chapter 6, Deep Learning
Attention Is All You Need: Transformers
Awesome Resource for NLP
Awesome-LLM: A curated list of resources for Large Language Models.
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
bigscience/bloom · Hugging Face
brexhq/prompt-engineering
Broadcasting in Python (C1W2L15): YT Video by Andrew Ng
Building a news aggregator from scratch: news filtering, classification, grouping in threads and ranking
Building a semantic search engine in Python by Vikas Paruchuri
Building a sentence embedding index with fastText and BM25
But what is a GPT? Visual intro to transformers: Chapter 5, Deep Learning
Can LLMs learn from a single example?
Central Limit Theorem by 3Blue1Brown
Chip Huyen - Blog
CNNs are still widely used: Tweet by Sebastian Raschka
CS224N: Natural Language Processing with Deep Learning by Stanford University
Curated papers, articles, and blogs on data science & machine learning in production
Curated papers/blogs, guides, and interviews with ML practitioners.
dair-ai/ML-Papers-Explained
Day 1 of 60 Days of Deep Learning with Projects Series
Decoupled Weight Decay Regularization: AdamW
Deep Learning Book a book by Ian Goodfellow and Yoshua Bengio and Aaron Courville TIP: Read only the first 2 parts, skip the 3rd.
Deep Learning Tuning Playbook by Google
Deep Learning with torch:: CHEAT SHEET by rstudio.com
Designing Machine Learning Systems - Book by Chip Huyen
Efficient Transformers: A Survey
Efficiently Scaling Transformer Inference
Efficiently Scaling Transformer Inference
eugeneyan/applied-ml
Evaluating LLMs Trained on Code
Extending Context Length in Large Language Models
fast.ai course(s) by Jeremy Howard
Few-shot learning in practice: GPT-Neo and the 🤗 Accelerated Inference API by Huggingface
From zero to GPT: A course by Vikas Paruchuri
Getting Started With Embeddings by Huggingface
Getting started with Transformers and TPU using PyTorch
GPT-2: Language Models are Unsupervised Multitask Learners: The original paper that introduced the GPT-2 model.
GPT-3: Language Models are Few-Shot Learners: The original paper that introduced the GPT-3 model.
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models: RNNs
How to Build a Semantic Search Engine With Transformers and Faiss
How to Build & Understand GPT-7's Mind Sholto Douglas & Trenton Bricken w/ Dwarkesh Patel
How to Differentiate Between Scaling, Normalization, and Log Transformations
How to implement Q&A against your documentation with GPT3, embeddings and Datasette
How to scale LLM workloads to 20B+ with Amazon SageMaker using Hugging Face and PyTorch FSDP
How Transformers work in deep learning and NLP: an intuitive introduction
Hugging Face Transformers: A library of pre-trained models for NLP tasks.
Implemented Deep Learning Projects
Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch
Intro to Deep Learning and Generative Models Course: by Sebastian Raschka
Karpathy YouTube Karpathy videos
Language Identification from Very Short Strings
Language Models and Contextualised Word Embeddings
Language Models are Unsupervised Multitask Learners: GPT-2
Let's build GPT: from scratch, in code, spelled out.: by Andrej Karpathy
Let's build the GPT Tokenizer by Andrej Karpathy
LLM Numbers Everyone Should know
LLM Visualised an interactive/animated LLM visualisation
LoRA: Low-Rank Adaptation of Large Language Models: LoRA
manuelyhvh/nlp-with-transformers
Mathematics for Machine Learning by Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong
Measuring Massive Multitask Language Understanding)
Meet Claude: Anthropic’s Rival to ChatGPT
Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models:
Mixture of Experts Explained
Modern Deep Learning Techniques Applied to Natural Language Processing
Mooler0410/LLMsPracticalGuide
Multimodal Chain-of-Thought Reasoning in Language Models by Amazon Science
Natural Language Processing with Transformers Book
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE: RNN attention
Neural Networks (incl Transformers) by 3Blue1Brown
Neural Networks - From the ground up: YouTube series from 3Blue1Brown
NLP Best Practices
NLP's ImageNet moment has arrived
OLMo: Accelerating the Science of Language Models
On the Opportunities and Risks of Foundation Models
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
Papers with Code
PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware by Huggingface
PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware by HuggingFace
Post-processing in automatic speech recognition systems
Probabilities of Probabilities by 3Blue1Brown
Prompt Engineering And Why It Matters To The AI Revolution
PyTorch Cheat Sheet by PyTorch
PyTorch DL Cheat Sheet by DataCamp
Q&A with GPT-Index
Quantization Fundamentals with Hugging Face by Huggingface
Quantization Fundamentals with Huggingface
Retrieval-Augmented Generation (RAG) Made Simple & 2 How To Tutorials
RLHF Automatic Prompt Engineer for Stable Diffusion 2
Scaling Up AI Research to Production with PyTorch and MLFlow
Scattertext
Self-Attention clearly explained
Semantic search with embeddings: index anything
Semantic Search with S-BERT is all you need
spaCy 101: Everything you need to know
Stanford MLSys Seminars on YouTube
Super Easy Way to Get Sentence Embedding using fastText in Python Medium Article
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity: Switch transformer
swyxio/ai-notes
Text Similarities : Estimate the degree of similarity between two texts
The 2024 MAD (ML, AI & Data) Landscape
The Annotated GPT-2: A detailed explanation of the GPT-2 model architecture.
The Annotated Transformer Alt: A detailed explanation of the Transformer model architecture.
The Essence of Linear Algebra Videos and as text YouTube playlist/course by 3Blue1Brown
The fall of RNN / LSTM
The Illustrated GPT-2 (Visualizing Transformer Language Models)
The Illustrated GPT-2 (Visualizing Transformer Language Models)
The Illustrated GPT-2: A visual guide to the GPT-2 model architecture.
The Illustrated Transformer: A visual guide to the Transformer model architecture.
The Principles of Deep Learning Theory - An Effective Theory Approach to Understanding Neural Networks by Daniel A. Roberts and Sho Yaida
The Transformer: Attention is All You Need: The original paper that introduced the Transformer model.
Thinking Like Transformers
Token Selection Strategies: Top-k, Top-p, and Temperature: by Peter Chng
torchtune - Easily fine-tune LLMS using PyTorch on pytorch.com
Tracking Progress in Natural Language Processing
Transformers
Turing-NLG: A 17-billion-parameter language model by Microsoft
Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch by Sebastian Raschka
Understanding text with BERT
Vaclav Kosar's Software & Machine Learning Blog
Vector databases (4): Analyzing the trade-offs
What are the recent trends in machine learning, deep learning, and AI: bySebastian Raschka
What I learned from looking at 900 most popular open source AI tools
What I learned from looking at 900 most popular open source AI tools by Chip Huyen
Where does AI come from, and where is it heading? - Data Big and Small
Yann LeCun & Alfredo Canziani - DEEP LEARNING NYU CENTER FOR DATA SCIENCE
Yann LeCun’s Deep Learning Course at CDS

Resources

Free ML Training Resources:

Discord Servers:

florianbuetow/llm-fasttrack

LLM-FastTrack

What's Inside

My Notebooks

Why This Repo

Studyplan

1. Getting Good with PyTorch

My Study Notes

Before getting started

PyTorch

1.1. PyTorch Basics

1.1.1. PyTorch Tensors vs Numpy (Arrays)

Similarities between PyTorch Tensors and NumPy Arrays

Differences between PyTorch Tensors and NumPy Arrays

Use Cases

Summary:

Conclusion:

PyTorch Tensors

Study Notes

Study Resources

II. Learning about to LLMs

Basics of Transformers

III. Mathematical Foundations

Chapter Overview

3.1. Linear Algebra

Study Notes

Study Resources

IV. Fine-Tuning and Optimising LLMs

V. RAG: Retrieval-Augmented Generation

VI. Developing real-world Applications with LLMs

Terms and Concepts (uncategorized)

Advice

Reading List

Archive

Paper - Training Compute-Optimal Large Language Models by Google DeepMind

Resources