/Generative-AI-Tutorial

A subjective learning guide for generative AI research

MIT LicenseMIT

Generative AI Roadmap

A subjective learning guide for generative AI research including curated list of articles and projects

Generative AI is a hot topic today 🔥 and this roadmap is designed to help beginners quickly gain basic knowledge and find useful resources of Generative AI. Even experts are welcome to refer to this roadmap to recall old knowledge and develop new ideas.

Table of Content

Background Knowledge

This section should help you learn or regain the basic knowledge of neural networks (e.g., backpropagation), get you familiar with the transformer architecture, and describe some common transformer-based models.

Neural Networks Inference and Training

Are you very familiar with the following classic neural network structures?

📝 If so, you should be able to answer these questions:

  • Why do CNNs work better than MLPs on images?
  • Why do RNNs work better than MLPs on time-series data?
  • What's the difference between GRU and LSTM?

Backpropagation (BP) is the base of NN training. You will not be an AI expert if you don't understand BP. There are many textbooks and online tutorials teaching BP, but unfortunately, most of them don't present formulas in vectorized/tensorized forms. The BP formula of an NN layer is indeed as neat as its forward pass formula. This is exactly how BP is implemented and should be implemented. To understand BP, please read the following materials:

📝 If you understand BP, you should be able to answer these questions:

  • How will you describe the BP of a convolutional layer?
  • What is the ratio of the computing cost (i.e., number of floating point operations) between forward pass and backward pass of a dense layer?
  • How will you describe the BP of an MLP with two dense layers sharing the same weight matrix?

Transformer Architecture

Transformer is the base architecture of existing large generative models. It's necessary to understand every component in the transformer. Please read the following materials:

📝 If you understand transformers, you should be able to answer these questions:

  • What are the pros and cons of tranformers compared to RNNs?(simultaneously attending, training parallelism, complexity)
  • Can you caculate the FLOPs of GQA? See when does it degrade to MHA and MQA?
  • What is the motivation of MQA and GQA?
  • What does the causal attention mask look like and why?
  • How will you describe the training of decoder-only transformers step by step?
  • Why is RoPE better than sinusoidal positional encoding?

Common Transformer-based Models

Miscellaneous

Large Language Models (LLMs)

LLMs are transformers. They can be categorized into encoder-only, encoder-decoder, and decoder-only architectures, as shown in the LLM evolutionary tree below [image source]. Check milestone papers of LLMs.

LLM Evolutionary Tree

Encoder-only model can be used to extract sentence features but lacks generative power. Encoder-decoder and decoder-only models are used for text generation. In particular, most existing LLMs prefer decoder-only structures due to stronger repesentational power. Intuitively, encoder-decoder models can be considered a sparse version of decoder-only models and the information decays more from encoder to decoder. Check this paper for more details.

Pretraining and Finetuning

LLMs are typically pretrained from trillions of text tokens by model publishers to internalize the natural language structure. Today's model developers also conduct instructional fine-tuning and Reinforcement Learning from Human Feedback (RLHF) to teach the model to follow human instructions and generate answers aligned with human preference. The users can then download the published model and finetune it on small personal datasets (e.g., movie dialog). Due to huge amount of data, pretraining requires massive computing resources (e.g., more than thousands of GPUs) which is unaffordable by individuals. On the other hand, fine-tuning is less resource-hungry and can be done with a few GPUs.

The following materials can help you understand the pretraining and fine-tuning process:

More tutorials can be found here.

Prompting

Prompting techniques for LLMs involve crafting input text in a way that guides the model to generate desired responses or outputs. Here are the useful resources to help you write better prompts:

Evaluation

Evaluation tools for large language models help assess their performance, capabilities, and limitations across different tasks and datasets. Here are some common evaluation strategies:

  • Automatic Evaluation Metrics: These metrics assess model performance automatically without human intervention. Common metrics include:

    • BLEU: Measures the similarity between generated text and reference text based on n-gram overlap.
    • ROUGE: Evaluates text summarization by comparing overlapping n-grams between generated and reference summaries.
    • Perplexity: Measures how well a language model predicts a sample of text. Lower perplexity indicates better performance. It is equivalent to the exponentiation of the cross-entropy between the data and model predictions.
    • F1 Score: Measures the balance between precision and recall in tasks like text classification or named entity recognition.
  • Human Evaluation: Human judgment is essential for assessing the quality of generated text comprehensively. Common human evaluation methods include:

    • Human Ratings: Human annotators rate generated text based on criteria such as fluency, coherence, relevance, and grammaticality.
    • Crowdsourcing Platforms: Platforms like Amazon Mechanical Turk or Figure Eight facilitate large-scale human evaluation by crowdsourcing annotations.
    • Expert Evaluation: Domain experts assess model outputs to gauge their suitability for specific applications or tasks.
  • Benchmark Datasets: Standardized datasets enable fair comparison of models across different tasks and domains. Examples include:

  • Model Analysis Tools: Tools for analyzing model behavior and performance include:

A complete list can be found here

Standard evaluation frameworks for existing LLMs include:

  • lm-evaluation-harness - A framework for few-shot evaluation of language models.
  • lighteval - a lightweight LLM evaluation suite that Hugging Face has been using internally.
  • OLMO-eval - a repository for evaluating open language models.
  • instruct-eval - This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.

Dealing with Long Context

Dealing with long contexts poses a challenge for large language models due to limitations in memory and processing capacity. Existing techniques include:

A complete list can be found here

Efficient Finetuning

Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to various downstream applications by only fine-tuning a small number of (extra) model parameters instead of all the model's parameters:

More work can be found in Huggingface PEFT paper collection and it's highly recommended to practice with HuggingFace PEFT API.

Model Merging

Model merging refers to merging two or more LLMs trained on different tasks into a single LLM. This technique aims to leverage the strengths and knowledge of different models to create a more robust and capable model. For example, a LLM for code generation and another LLM for math prolem solving can be merged together so that the merged model is capable of doing both code generation and math problem solving.

The model merging is intriguing because it can be effectively achieved with very simple and cheap algorithms (e.g., linear combination of model weights). Here are some representative papers and reading materials:

More papers about model merging can be found here

Efficient Generation

Accelerating decoding of LLMs is crucial for improving inference speed and efficiency, especially in real-time or latency-sensitive applications. Here are some representative work of speeding up decoding process of LLMs:

More work about accelerating LLM decoding can be found via Link 1 and Link 2.

Knowledge Editing

Knowledge editing aims to efficiently modify LLMs behaviors, such as reducing bias and revising learned correlations. It includes many topics such as knowledge localization and unlearning. Representative work includes:

More papers can be found here.

LLM-powered Agents

By receiving massive training, LLMs digest world knowledge and are able to follow input instructions precisely. With these amazing capabilities, LLMs can play as agents that are possible to autonomously (and collaboratively) solve complex tasks, or simulate human interactions. Here are some representative papers of LLM agents:

A complete list of papers, platforms, and evaluation tools can be found here.

Findings

Open Challenges

LLMs face several open challenges that researchers and developers are actively working to address. These challenges include:

A complete list can be found here.

Diffusion Models

Diffusion models aim to approxmiate the probability distribution of a given data domain, and provide a way to generate samples from its approximated distribution. Their goals are similar to other popular generative models, such as VAE, GANs, and Normalizing Flows.

The working flow of diffusion models is featured with two process:

  1. Forward process (diffusion process): it progressively applies noise to the original input data step by step until the data completely becomes noise.
  2. Reverse process (denoising process): an NN model (e.g., CNN or tranformer) is trained to estimate the noise being applied in each step during the forward process. This trained NN model can then be used to generate data from noise input. Existing diffusion models can also accept other signals (e.g., text prompts from users) to condition the data generation.

Check this awesome blog and more introductory tutorials can be found here. Diffusion models can be used to generate images, audios, videos, and more, and there are many subfields related to diffusion models as shown below [image source]:

Diffusion Model Taxonomy

Image Generation

Here are some representative papers of diffusion models for image generation:

More papers can be found here.

Video Generation

Here are some representative papers of diffusion models for video generation:

More papers can be found here.

Audio Generation

Here are some representative papers of diffusion models for audio generation:

More papers can be found here.

Pretraining and Finetuning

Similar to other large generative models, diffusion models are also pretrained on large amount of web data (e.g., LAION-5B dataset) and consume massive computing resources. Users can download the released weights can further fine-tune the model on personal datasets.

Here are some representative papers of efficient fine-tuning of diffusion models:

More papers can be found here.

It's highly recommended to do some practice with Huggingface Diffusers API.

Evaluation

Here we talk about evaluation of diffusion models for image generation. Many existing image quality metrics can be applied.

  • CLIP score: CLIP score measures the compatibility of image-caption pairs. Higher CLIP scores imply higher compatibility. CLIP score was found to have high correlation with human judgement.
  • Fréchet Inception Distance (FID): FID aims to measure how similar are two datasets of images. It is calculated by computing the Fréchet distance between two Gaussians fitted to feature representations of the Inception network
  • CLIP directional similarity: It measures the consistency of the change between the two images (in CLIP space) with the change between the two image captions.

More image quality metrics and calculation tools can be found here.

Efficient Generation

Diffusion models require multiple forward steps over to generate data, which is expensive. Here are some representative papers of diffusion models for efficient generation:

More papers can be found here.

Knowledge Editing

Here are some representative papers of knowledge editing for diffusion models:

More papers can be found here.

Open Challenges

Here are some survey papers talking about the challenges faced by diffusion models.

Large Multimodal Models (LMMs)

Typical LMMs are constructed by connecting and fine-tuning existing pretrained unimodal models. Some are also pretrained from scratch. Check how LMMs evolve in the image below [image source].

Diffusion Model Taxonomy

Model Architectures

There are many different ways of contructing LMMs. Representative architectures include:

More papers can be found via Link 1 and Link 2.

Towards Embodied Agents

By combining LMMs with robots, researchers aim to develop AI systems that can perceive, reason about, and act upon the world in a more natural and intuitive way, with potential applications spanning robotics, virtual assistants, autonomous vehicles, and beyond. Here are some representative work of realizing embodied AI with LMMs:

More papers can be found via Link 1 and Link 2.

Here are some popular simulators and datasets to evaluate LMMs performance for embodied AI:

More resources can be found here.

Open Challenges

Here are some survey papers talking about open challenges for LMM-enabled embodied AI:

Beyond Transformers

Researchers are trying to explore new models other than transformers. The efforts include implicitly structuring model parameters and defining new model architectures.

Implictly Structured Parameters

New Model Architectures

Here is an awesome tutorial for state space models.