CLIP Diffusion Art
Fine-tune diffusion models on custom datasets and sample with text-conditioning using CLIP guidance and SwinIR for super resolution.
📌 Dataset with public domain artworks created for this project:
📌 Link to interactive run in notebook:
 Stunning Art with CLIP Guided Diffusion+SwinIR
📌 Wandb logging is integrated for training and sampling.
Generated Samples
"vibrant watercolor painting of a flower, artstation HQ"
"beautiful matte painting of dystopian city, Behance HD"
"vibrant watercolor painting of a flower, artstation HQ"
"artstation HQ, photorealistic depiction of an alien city"
report
For more generated artworks, visit thisSuper-resolution Results
Credits
Developed using techniques and architectures borrowed from original work by the authors below:
-
Original notebook on CLIP guidance sampling by Katherine Crowson (https://github.com/crowsonkb, https://twitter.com/RiversHaveWings) with improvements by nerdyrodent and sadnow (@sadly_existent)
-
SwinIR: Image Restoration Using Shifted Window Transformer from https://github.com/JingyunLiang/SwinIR
Huge thanks to all their great work! I highly recommend checking out these repos.
Installation
git clone https://github.com/sreevishnu-damodaran/clip-diffusion-art.git -q
cd clip-diffusion-art
pip install -e . -q
git clone https://github.com/JingyunLiang/SwinIR.git -q
git clone https://github.com/crowsonkb/guided-diffusion -q
pip install -e guided-diffusion -q
git clone https://github.com/openai/CLIP -q
pip install -e ./CLIP -q
Dataset
Public Domain Artworks dataset used in this repo:
https://www.kaggle.com/sreevishnudamodaran/artworks-in-public-domain
Additional details datasets/README.md
Training & Fine-tuning
Chooose the hyperparameters for training. These are resonable defaults to fine-tune on a custom dataset with a 16GB GPUs on Colab or Kaggle:
MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 2 --num_heads 1 --attention_resolutions 16"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear --learn_sigma True --rescale_learned_sigmas True --rescale_timesteps True --use_scale_shift_norm False"
TRAIN_FLAGS="--lr 5e-6 --save_interval 500 --batch_size 16 --use_fp16 True --wandb_project diffusion-art-train --use_checkpoint True --resume_checkpoint pretrained_models/lsun_uncond_100M_1200K_bs128.pt"
Once the hyperparameters are set, run the traning job as follows:
python clip_diffusion_art/train.py --data_dir path/to/images $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS
Refer to the openai improved diffusion for more details on choosing hyperparameters and to select other pre-trained weights.
Download SR pre-trained weights
wget https://github.com/JingyunLiang/SwinIR/releases/download/v0.0/003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN.pth
Passing the sr_model_path
flag to sample.py
performs super-resolution to each image after sampling.
Sample Images with CLIP Guidance
python clip_diffusion_art/sample.py \
"beautiful matte painting of dystopian city, Behance HD" \
--checkpoint 256x256_clip_diffusion_art.pt \
--model_config "clip_diffusion_art/configs/256x256_clip_diffusion_art.yaml" \
--sampling "ddim50" \
--cutn 60 \
--cut_batches 4 \
--sr_model_path pretrained_models/003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN.pth \
--large_sr \
--output_dir "outputs"
Options:
--images
- image prompts (default=None)
--checkpoint
- diffusion model checkpoint to use for sampling
--model_config
- diffusion model config yaml
--wandb_project
- enable wandb logging and use this project name
--wandb_name
- optinal run name to use for wandb logging
--wandb_entity
- optinal entity to use for wandb logging
--num_samples
- - number of samples to generate (default=1)
--batch_size
- default=1batch size for the diffusion model
--sampling
- timestep respacing sampling methods to use (default="ddim50", choices=[25, 50, 100, 150, 250, 500, 1000, ddim25, ddim50, ddim100, ddim150, ddim250, ddim500, ddim1000])
--diffusion_steps
- number of diffusion timesteps (default=1000)
--skip_timesteps
- diffusion timesteps to skip (default=5)
--clip_denoised
- enable to filter out noise from generation (default=False)
--randomize_class_disable
- disables changing imagenet class randomly in each iteration (default=False)
--eta
- the amount of noise to add during sampling (default=0)
--clip_model
- CLIP pre-trained model to use (default="ViT-B/16",
choices=["RN50","RN101","RN50x4","RN50x16","RN50x64","ViT-B/32","ViT-B/16","ViT-L/14"])
--skip_augs
- enable to skip torchvision augmentations (default=False)
--cutn
- the number of random crops to use (default=16)
--cutn_batches
- number of crops to take from the image (default=4)
--init_image
- init image to use while sampling (default=None)
--loss_fn
- loss fn to use for CLIP guidance (default="spherical", choices=["spherical" "cos_spherical"])
--clip_guidance_scale
- CLIP guidance scale (default=5000)
--tv_scale
- controls smoothing in samples (default=100)
--range_scale
- controls the range of RGB values in samples (default=150)
--saturation_scale
- controls the saturation in samples (default=0)
--init_scale
- controls the adherence to the init image (default=1000)
--scale_multiplier
- scales clip_guidance_scale tv_scale and range_scale (default=50)
--disable_grad_clamp
- disable gradient clamping (default=False)
--sr_model_path
- SwinIR super-resolution model checkpoint (default=None)
--large_sr
- enable to use large SwinIR super-resolution model (default=False)
--output_dir
- output images directory (default="output_dir")
--seed
- the random seed (default=47)
--device
- the device to use
Apply Super-resolution
Use the following to run super-resolution on other images or use it for other tasks (grayscale/color image denoising/JPEG compression artifact reduction)
python swinir.py <path-to-images-dir> --task "real_sr"
data_dir
- directory with images
--task
- image restoration task (default='real_sr', choices=['real_sr', 'color_dn', 'gray_dn', 'jpeg_car'])