BERT |
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model that uses a language model to generate text. |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |
BLAS |
BLAS (Basic Linear Algebra Subprograms) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. |
Basic Linear Algebra Subprograms |
CLIP |
CLIP (Contrastive Language-Image Pre-training) is a transformer-based model that uses a language model to generate text. |
CLIP: Connecting text and images |
DPO |
Direct Preference Optimization (DPO) is a technique that is used to train a model to predict the next token in a sequence. |
Direct Preference Optimization |
DSPy |
Demonstrate-Sequencing-Predict (DSPy) is a technique that is used to train a model to predict the next token in a sequence. |
DSPy: Programming—not prompting—Foundation Models |
GGUF |
GPT-Generated Unified Format (GGUF) |
GGML vs GGUF |
GGML |
Georgi Gerganov’s Machine Learning (GGML) was an early attempt to create a file format for storing GPT models. |
GGML vs GGUF |
GLU |
GLU (Gated Linear Unit) is a gating mechanism that uses a sigmoid function to gate the output of a linear layer. |
Language Models are Unsupervised Multitask Learners |
FSDP |
Fully Sharded Data Parallelism (FSDP) is a data parallel training technique that shards both the model and the data across multiple GPUs. |
Fully Sharded Data Parallelism in PyTorch |
GPT |
GPT (Generative Pre-trained Transformer) is a transformer-based model that uses a language model to generate text. |
Improving Language Understanding by Generative Pre-Training |
MLP |
MLP (Multi-Layer Perceptron) is a feedforward neural network that consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. |
A Brief Introduction to Neural Networks |
NLP |
NLP (Natural Language Processing) is a field of computer science that focuses on the interaction between computers and humans using natural language. |
Natural Language Processing |
NN |
NN (Neural Network) is a computing system that is inspired by the biological neural networks that constitute animal brains. |
Artificial Neural Network |
RAG |
RAG (Retrieval-Augmented Generation) is a transformer-based model that uses a retriever to retrieve relevant passages from a knowledge base and then uses a generator to generate an answer based on the retrieved passages. |
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks |
RoPE |
RoPE (Rotary Positional Encoding) is a positional encoding that uses a rotary encoding scheme to represent the absolute position of a token within a sequence. |
Rotary Positional Embedding |
SFT |
Shrink and Fine-Tune (SFT) is a type of distillation that avoids explicit distillation by copying parameters to a student model and then fine-tuning |
Pre-trained Summarization |
SFT |
Supervised Fine-tuning (SFT) involves adapting a pre-trained Language Model (LLM) to a specific downstream task using labeled data. |
Supervised Fine-tuning: customizing LLMs |
SwiGLU |
SwiGLU is an activation function which is a variant of GLU. |
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity |
TRL |
Technology Readiness Levels (TRL) are a type of measurement system used to assess the maturity level of a particular technology. |
Technology Readiness Level |
ZeRO |
ZeRO (Zero Redundancy Optimizer) is a novel memory optimization technology that allows users to train deep learning models with parameter sizes that exceed GPU VRAM sizes. |
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models |