Generative AI Roadmap

A subjective learning guide for generative AI research including curated list of articles and projects

Generative AI is a hot topic today 🔥 and this roadmap is designed to help beginners quickly gain basic knowledge and find useful resources of Generative AI. Even experts are welcome to refer to this roadmap to recall old knowledge and develop new ideas.

Table of Content

Background Knowledge
Large Language Models (LLMs)
Diffusion Models
Large Multimodal Models (LMMs)
Beyond Transformers
- Implicitly Structured Parameters
- New Model Architectures

Background Knowledge

This section should help you learn or regain the basic knowledge of neural networks (e.g., backpropagation), get you familiar with the transformer architecture, and describe some common transformer-based models.

Neural Networks Inference and Training

Are you very familiar with the following classic neural network structures?

📝 If so, you should be able to answer these questions:

Why do CNNs work better than MLPs on images?
Why do RNNs work better than MLPs on time-series data?
What's the difference between GRU and LSTM?

Backpropagation (BP) is the base of NN training. You will not be an AI expert if you don't understand BP. There are many textbooks and online tutorials teaching BP, but unfortunately, most of them don't present formulas in vectorized/tensorized forms. The BP formula of an NN layer is indeed as neat as its forward pass formula. This is exactly how BP is implemented and should be implemented. To understand BP, please read the following materials:

Neural Networks and Deep Learning [Chapter 3.2 especially 3.2.6]
meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting (ICML 2017) [Section 2.1]
Resprop: Reuse sparsified backpropagation (CVPR 2020) [Section 3.1]

📝 If you understand BP, you should be able to answer these questions:

How will you describe the BP of a convolutional layer?
What is the ratio of the computing cost (i.e., number of floating point operations) between forward pass and backward pass of a dense layer?
How will you describe the BP of an MLP with two dense layers sharing the same weight matrix?

Transformer Architecture

Transformer is the base architecture of existing large generative models. It's necessary to understand every component in the transformer. Please read the following materials:

Attention Is All You Need (NeurIPS 2017) [Original Paper]
An image is worth 16x16 words: Transformers for image recognition at scale (ICLR 2021) [Vision Transformer]
Neural machine translation with a Transformer and Keras [Great Explanation for MultiHead Attention (MHA)]
FLOPs of a Transformer Block [Let's practice calculating FLOPs]
Fast Transformer Decoding: One Write-Head is All You Need [Multi-Query Attention (MQA)]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints [Grouped-Query Attention (GQA)]
Enhanced Transformer with Rotary Position Embedding [Understand Positional Embedding]
Rotary Embeddings: A Relative Revolution [Understand Positional Embedding]
Teacher Forcing vs Scheduled Sampling vs Normal Mode [Teacher Forcing in Transformer Training]
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU [See section 3 - generative inference to learn how LLMs peform generation based on KV cache]
Contextual Position Encoding: Learning to Count What’s Important [Context-dependent positional encoding]

📝 If you understand transformers, you should be able to answer these questions:

What are the pros and cons of tranformers compared to RNNs？(simultaneously attending, training parallelism, complexity)
Can you caculate the FLOPs of GQA? See when does it degrade to MHA and MQA?
What is the motivation of MQA and GQA?
What does the causal attention mask look like and why?
How will you describe the training of decoder-only transformers step by step?
Why is RoPE better than sinusoidal positional encoding?

Common Transformer-based Models

Miscellaneous

Einsum is easy and useful [A great tutorial for using einsum/einops]
Open-Endedness is Essential for Artificial Superhuman Intelligence (ICML 2024) [Thoughts on achieving superhuman AI]

Large Language Models (LLMs)

LLMs are transformers. They can be categorized into encoder-only, encoder-decoder, and decoder-only architectures, as shown in the LLM evolutionary tree below [image source]. Check milestone papers of LLMs.

Encoder-only model can be used to extract sentence features but lacks generative power. Encoder-decoder and decoder-only models are used for text generation. In particular, most existing LLMs prefer decoder-only structures due to stronger repesentational power. Intuitively, encoder-decoder models can be considered a sparse version of decoder-only models and the information decays more from encoder to decoder. Check this paper for more details.

Pretraining and Finetuning

LLMs are typically pretrained from trillions of text tokens by model publishers to internalize the natural language structure. Today's model developers also conduct instructional fine-tuning and Reinforcement Learning from Human Feedback (RLHF) to teach the model to follow human instructions and generate answers aligned with human preference. The users can then download the published model and finetune it on small personal datasets (e.g., movie dialog). Due to huge amount of data, pretraining requires massive computing resources (e.g., more than thousands of GPUs) which is unaffordable by individuals. On the other hand, fine-tuning is less resource-hungry and can be done with a few GPUs.

The following materials can help you understand the pretraining and fine-tuning process:

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [Pretraining and Finetuning of Encoder-only LLMs]
Scaling Instruction-Finetuned Language Models [Pretraining and Instructional Finetuning]
Illustrating Reinforcement Learning from Human Feedback (RLHF)
Language Models are Few-Shot Learners [Decoder-only LLMs] [中文导读 by 李沐]

Prompting

Prompting techniques for LLMs involve crafting input text in a way that guides the model to generate desired responses or outputs. Here are the useful resources to help you write better prompts:

[DAIR.AI] Prompt Engineering Guide
Awesome ChatGPT Prompts - A collection of prompt examples to be used with the ChatGPT model
Awesome Deliberative Prompting - How to ask LLMs to produce reliable reasoning and make reason-responsive decisions
AutoPrompt - An automated method based on gradient-guided search to create prompts for a diverse set of NLP tasks.

Evaluation

Evaluation tools for large language models help assess their performance, capabilities, and limitations across different tasks and datasets. Here are some common evaluation strategies:

Automatic Evaluation Metrics: These metrics assess model performance automatically without human intervention. Common metrics include:
- BLEU: Measures the similarity between generated text and reference text based on n-gram overlap.
- ROUGE: Evaluates text summarization by comparing overlapping n-grams between generated and reference summaries.
- Perplexity: Measures how well a language model predicts a sample of text. Lower perplexity indicates better performance. It is equivalent to the exponentiation of the cross-entropy between the data and model predictions.
- F1 Score: Measures the balance between precision and recall in tasks like text classification or named entity recognition.
Human Evaluation: Human judgment is essential for assessing the quality of generated text comprehensively. Common human evaluation methods include:
- Human Ratings: Human annotators rate generated text based on criteria such as fluency, coherence, relevance, and grammaticality.
- Crowdsourcing Platforms: Platforms like Amazon Mechanical Turk or Figure Eight facilitate large-scale human evaluation by crowdsourcing annotations.
- Expert Evaluation: Domain experts assess model outputs to gauge their suitability for specific applications or tasks.
Benchmark Datasets: Standardized datasets enable fair comparison of models across different tasks and domains. Examples include:
- TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
- HellaSwag: Can a Machine Really Finish Your Sentence?
- GSM8K: Training Verifiers to Solve Math Word Problems
- A complete list can be found here
Model Analysis Tools: Tools for analyzing model behavior and performance include:
- Automated Interpretability - Code for automatically generating, simulating, and scoring explanations of neuron behavior
- LLM Visualization - Visualizing LLMs in low level.
- Attention Analysis - Analyzing attention maps from BERT transformer.
- Neuron Viewer - Tool for viewing neuron activations and explanations.

A complete list can be found here

Standard evaluation frameworks for existing LLMs include:

lm-evaluation-harness - A framework for few-shot evaluation of language models.
lighteval - a lightweight LLM evaluation suite that Hugging Face has been using internally.
OLMO-eval - a repository for evaluating open language models.
instruct-eval - This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.

Dealing with Long Context

Dealing with long contexts poses a challenge for large language models due to limitations in memory and processing capacity. Existing techniques include:

A complete list can be found here

Efficient Finetuning

Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to various downstream applications by only fine-tuning a small number of (extra) model parameters instead of all the model's parameters:

More work can be found in Huggingface PEFT paper collection and it's highly recommended to practice with HuggingFace PEFT API.

Model Merging

Model merging refers to merging two or more LLMs trained on different tasks into a single LLM. This technique aims to leverage the strengths and knowledge of different models to create a more robust and capable model. For example, a LLM for code generation and another LLM for math prolem solving can be merged together so that the merged model is capable of doing both code generation and math problem solving.

The model merging is intriguing because it can be effectively achieved with very simple and cheap algorithms (e.g., linear combination of model weights). Here are some representative papers and reading materials:

More papers about model merging can be found here

Efficient Generation

Accelerating decoding of LLMs is crucial for improving inference speed and efficiency, especially in real-time or latency-sensitive applications. Here are some representative work of speeding up decoding process of LLMs:

More work about accelerating LLM decoding can be found via Link 1 and Link 2.

Knowledge Editing

Knowledge editing aims to efficiently modify LLMs behaviors, such as reducing bias and revising learned correlations. It includes many topics such as knowledge localization and unlearning. Representative work includes:

More papers can be found here.

LLM-powered Agents

By receiving massive training, LLMs digest world knowledge and are able to follow input instructions precisely. With these amazing capabilities, LLMs can play as agents that are possible to autonomously (and collaboratively) solve complex tasks, or simulate human interactions. Here are some representative papers of LLM agents:

Generative Agents: Interactive Simulacra of Human Behavior (UIST 2023) [LLMs simulate human society in video games]
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents (ICLR 2024) [LLMs simulate social interactions]
Voyager: An Open-Ended Embodied Agent with Large Language Models [LLMs live in the Minecraft world]
Large Language Models as Tool Makers (ICLR 2024) [LLMs create their own reusable tools (e.g., in python functions) for problem-solving]
MetaGPT: Meta Programming for Multi-Agent Collaborative Framework [LLMs as a team for automated software development]
WebArena: A Realistic Web Environment for Building Autonomous Agents (ICLR 2024) [LLMs use web applications]
Mobile-Env: An Evaluation Platform and Benchmark for LLM-GUI Interaction [LLMs use mobile applications]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face (NeurIPS 2023) [LLMs seek models in huggingface for problem-solving]
AGENTGYM: Evolving Large Language Model-based Agents across Diverse Environments [Diverse interactive environments and tasks for LLM-based agents]

A complete list of papers, platforms, and evaluation tools can be found here.

Findings

Open Challenges

LLMs face several open challenges that researchers and developers are actively working to address. These challenges include:

Hallucination
- A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
Model Compression
- A Comprehensive Survey of Compression Algorithms for Language Models
Evaluation
- Evaluating Large Language Models: A Comprehensive Survey
Reasoning
- A Survey of Reasoning with Foundation Models
Explainability
- From Understanding to Utilization: A Survey on Explainability for Large Language Models
Fairness
- A Survey on Fairness in Large Language Models
Factuality
- A Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
Knowledge Integration
- Trends in Integration of Knowledge and Large Language Models: A Survey and Taxonomy of Methods, Benchmarks, and Applications

A complete list can be found here.

Diffusion Models

Diffusion models aim to approxmiate the probability distribution of a given data domain, and provide a way to generate samples from its approximated distribution. Their goals are similar to other popular generative models, such as VAE, GANs, and Normalizing Flows.

The working flow of diffusion models is featured with two process:

Forward process (diffusion process): it progressively applies noise to the original input data step by step until the data completely becomes noise.
Reverse process (denoising process): an NN model (e.g., CNN or tranformer) is trained to estimate the noise being applied in each step during the forward process. This trained NN model can then be used to generate data from noise input. Existing diffusion models can also accept other signals (e.g., text prompts from users) to condition the data generation.

Check this awesome blog and more introductory tutorials can be found here. Diffusion models can be used to generate images, audios, videos, and more, and there are many subfields related to diffusion models as shown below [image source]:

Image Generation

Here are some representative papers of diffusion models for image generation:

More papers can be found here.

Video Generation

Here are some representative papers of diffusion models for video generation:

More papers can be found here.

Audio Generation

Here are some representative papers of diffusion models for audio generation:

More papers can be found here.

Pretraining and Finetuning

Similar to other large generative models, diffusion models are also pretrained on large amount of web data (e.g., LAION-5B dataset) and consume massive computing resources. Users can download the released weights can further fine-tune the model on personal datasets.

Here are some representative papers of efficient fine-tuning of diffusion models:

More papers can be found here.

It's highly recommended to do some practice with Huggingface Diffusers API.

Evaluation

Here we talk about evaluation of diffusion models for image generation. Many existing image quality metrics can be applied.

CLIP score: CLIP score measures the compatibility of image-caption pairs. Higher CLIP scores imply higher compatibility. CLIP score was found to have high correlation with human judgement.
Fréchet Inception Distance (FID): FID aims to measure how similar are two datasets of images. It is calculated by computing the Fréchet distance between two Gaussians fitted to feature representations of the Inception network
CLIP directional similarity: It measures the consistency of the change between the two images (in CLIP space) with the change between the two image captions.

More image quality metrics and calculation tools can be found here.

Efficient Generation

Diffusion models require multiple forward steps over to generate data, which is expensive. Here are some representative papers of diffusion models for efficient generation:

More papers can be found here.

Knowledge Editing

Here are some representative papers of knowledge editing for diffusion models:

More papers can be found here.

Open Challenges

Here are some survey papers talking about the challenges faced by diffusion models.

Large Multimodal Models (LMMs)

Typical LMMs are constructed by connecting and fine-tuning existing pretrained unimodal models. Some are also pretrained from scratch. Check how LMMs evolve in the image below [image source].

Model Architectures

There are many different ways of contructing LMMs. Representative architectures include:

More papers can be found via Link 1 and Link 2.

Towards Embodied Agents

By combining LMMs with robots, researchers aim to develop AI systems that can perceive, reason about, and act upon the world in a more natural and intuitive way, with potential applications spanning robotics, virtual assistants, autonomous vehicles, and beyond. Here are some representative work of realizing embodied AI with LMMs:

More papers can be found via Link 1 and Link 2.

Here are some popular simulators and datasets to evaluate LMMs performance for embodied AI:

More resources can be found here.

Open Challenges

Here are some survey papers talking about open challenges for LMM-enabled embodied AI:

Beyond Transformers

Researchers are trying to explore new models other than transformers. The efforts include implicitly structuring model parameters and defining new model architectures.

Implictly Structured Parameters

New Model Architectures

Here is an awesome tutorial for state space models.

pittisl/Generative-AI-Tutorial

Generative AI Roadmap

Table of Content

Background Knowledge

Neural Networks Inference and Training

Transformer Architecture

Common Transformer-based Models

Miscellaneous

Large Language Models (LLMs)

Pretraining and Finetuning

Prompting

Evaluation

Dealing with Long Context

Efficient Finetuning

Model Merging

Efficient Generation

Knowledge Editing

LLM-powered Agents

Findings

Open Challenges

Diffusion Models

Image Generation

Video Generation

Audio Generation

Pretraining and Finetuning

Evaluation

Efficient Generation

Knowledge Editing

Open Challenges

Large Multimodal Models (LMMs)

Model Architectures

Towards Embodied Agents

Open Challenges

Beyond Transformers

Implictly Structured Parameters

New Model Architectures