Simple Diffusion Image Generation

This was a project submitted to the University of Queensland for the course COMP3710.

Simple diffusion based image generation using PyTorch. This model can learn from a dataset of images and generate new images that are perceptually similar to those in the dataset.

References

Huge thanks to these videos for helping my understanding:

Diffusion models from scratch in PyTorch
- This repo was built largely referencing code in this colab notebook from the video. Quite a few changes were made to improve performance.
Diffusion Models | Paper Explanation | Math Explained

Diffusion papers:

train.py - Command line utility that trains a new diffusion model on a dataset.
dataset.py - Wraps a directory of image files in a PyTorch dataloader. Images can be any size or format that can be opened by PIL. All images are resized to a given dimension, converted to RGB and normalised to a range of -1 to 1.
modules.py - Contains a Trainer class to handle training of the model. Contains the U-Net model and required components.
predict.py - Command line utility to predict a new images from an existing .pth model

Usage

Prerequisites

A system (preferably linux) with either Anaconda or Miniconda installed.
A GPU with at least 12GB memory if you plan to train models

Setup

Clone this branch and cd to the recognition/45802492_SimpleDiffusion/ folder
Setup a new conda environment. An environment.yml file is supplied to do this automatically.
```
conda env create -f environment.yml
conda activate diff
```

Train a model

Create a folder with training images in the local directory (eg. PatternFlow/recognition/images). There are no requirements on image size or naming. All images within the this folder will resized and used to train the model.
Run the training script: python train.py name path which will start training. Every epoch a test image will be generated and saved to ./out and a denoising timestep plot will be save to ./plot.
Tensorboard is also supported and training is saved to ./runs. You can launch tensorboard using: tensorboard --logdir ./ to view loss metrics during training.
Once training has finished, the model will be saved as name.pth in the local directory. Additionally every epoch an autosave.pth file is also created.

Parameters for train.py

Parameter	Short		Default	Description
name		required		Name of model
path		required		Path to dataset folder
--timesteps	-t	optional	1000	Number of diffusion timesteps in betas schedule
--epochs	-e	optional	100	Number of epochs to train for
--batch_size	-b	optional	64	Training batch size
--image_size	-i	optional	64	Image dimension. All images are resized to size x size
--beta_schedule	-s	optional	linear	Beta schedule type. Options: 'linear', 'cosine', 'quadratic'and 'sigmoid'
--disable_images		optional		Disables saving images and plots every epoch
--disable_tensorboard		optional		Disables tensorboard for training

Using an existing model

Run the predict script python predict.py model
A random image will be generated using the supplied model and saved

Parameters for predict.py

Parameter	Short		Default	Description
model		required		Path to `.pth` model file
--output	-o	optional	./	Output path to save images
--name	-n	optional	predict	Name prefix to use for generated images
--num_images	-i	optional	1	Number of images to create

Some pretrained models are supplied in the examples section below.

Algorithm Description

Diffusion image generation is described in these papers: 1, 2. They work by describing a markov chain in which gaussain noise is sucessively added to an image for a defined number of timesteps $T$ using a variance schedule $\beta_1,...,\beta_T$.

This is called the forward diffusion process. The reverse diffusion process is the opposite in that given an image at a certain timestep $\mathbf{x}_t$, the denoised image is given by:

A U-Net neural network is then trained to predict the noise in an image for a given timestep. To do this, the timestep $t$ is positionally encoded using sinusoidal embeddings between the convolutional layers in the U-Net blocks. Training is performed by passing in large numbers of images from a dataset with noise added using the forward diffusion process. The U-Net is passed the noisy image and timestep as the input and the isolated noise as the target.

Once the U-Net has been trained, denoising can be performed on a random point in latent space (usually an image consisting of pure gaussian noise) using the U-Net by repeatedly subtracting the predicted noise over the entire reverse timestep range. This results in a new image that is perceptually similar to those in the training dataset.

This project uses a simplified U-Net design omitting some of the features described in the papers above. The general architecutre is:

Examples

AKOA Knee

Using part of the AKOA Knee dataset consisting of 18,681 MRI images. Image size 128x128, batch size 64, 1000 Timesteps, 100 epochs. Download the pretrained model.

Training

Epoch 0 Epoch 10 Epoch 20 Epoch 99

Some Examples After Training

OASIS Brain

Using the OASIS Brain with 11,329 images. Image size 128x128, 1000 Timesteps, batch size 32, 100 epochs. Notice the artifacts due to the small batch size. Download the pre-trained model.

Training

Epoch 0 Epoch 10 Epoch 20 Epoch 99

Examples after training

CelebA Dataset

Just for fun, the model was also trained on the CelebA dataset (aligned and cropped) consisting of around 200,000 images. Image size 128x128, batch size 64, 1000 Timesteps, 100 epochs. Download the pre-trained model. The network does well with the faces but struggles in generating hair and backgrounds.