Liyuan-Chen-1024/How-Diffusion-Models-Work

Notes from How Diffusion Models Work by DeepLearning.ai

Jupyter Notebook

How-Diffusion-Models-Work

Notes from How Diffusion Models Work by DeepLearning.ai

Contents

Intuition

Sampling

With Extra Noise

explorer_pC0437cXSo.mp4

Training

Context Embedding

Faster Sampling

Notes

Taught By Sharon Zhou

Noted by Atul

Missing Prerequisite: Backprop

Example used throughout the course: Generate 16X16 size sprites for video games.

Intuition

Goal : Given a lot of sprite images, generate even more sprite images

What does the network learn?
- Fine details
- General outline
- Everything in between
Noising Process (bob as ink drop analogy)

Denoising Process (what should the NN think?)
- If its' Bob the sprite, keep it as it is
- If its likely to be Bob, suggest more details to be filled
- If its just an outline of a sprite, suggest general details for likely sprite(bob/fred/...)
- If its nothing, suggest outline of a sprite
Give the NN input noise, whose pixels are obtained from Normal distribution, and get a completely new sprite !

Sampling

Assume you have a trained NN
At each denoising step, it predicts noise, and subtracts it to get a better image
NOTE: At each denoising step, some random noise is added again to prevent "mode collapse"

Neural Network

UNet Architecture
- Input and output of same size
- First used for image segmentation

Takes a noisy image, embeds into small space by downsampling, and upsamples to predict noise
Can take more info. in form of embeddings
- Time: related to timestep, and noise level added
- Context: guides generation process
Checkout forward() in sampling notebook

Training

Learns the distribution of what is "not noise"

Sample training image, timestep t, and noise, randomly
- Timestep helps control level of noise
- randomisation ensures a stable model
Add noise to image
Input this into NN, which predicts the noise
Compute loss between actual and predicted noise
Backprop and learn

Control

Embeddings are vectors , for instance, strings represented as number vectors
Given as input to NN along with training image
Get associated with a training example, and its properties
Uses: Generate funky mixtures by combining embeddings
Context formats
- Text
- Categories, one hot encoded (Eg. hero, non-hero, spells ...)

Fast Sampling : DDIM

DDPM is slow!
- Multiple timesteps, and markovian nature
Skips steps, making the process deterministic
Lower quality than DDPM

Summary

Other applications : Music, Inpainting, Textual Inversion