Finetuning Pretrained CLIP using DAMSM and Constrastive Loss for text to image synthesis

1. Methodology

Neural network for Text to Image generation is composed of 2 sub-networks.

Text Encoder and Generator Network

Therefore, It requires two-step training to train text-to-image generator.

  1. Image Encoder and Text Encoder are jointly pretrained from image-caption pair thereby projecting image and text to common space.
  2. After text encoder pretraining, Generator Network is advarsarialy trained to generate realistic image based on text feature.

Recent research proposed using DAMSM loss + Contrastive loss for pretraining text encoder and training DM-GAN, thereby reaching SOTA.

In this work, We replaced RNN based text encoder and CNN based image encoder with CLIP, which is pretrained multimodal Vision Language Model based on transformer architecture.


CLIP is multimodal encoder for image and natural language, which is pretrained using contrastive loss with huge batch size(=32768).

This is link for paper and official pytorch implementation of CLIP

3. Prepared Data

Download the preprocessed datasets from AttnGAN

Alternatively, another site is from DM-GAN

4. Trained model

5. Training

  1. Fine tuning pretrained CLIP encoder
  • With CUBS2011 using DAMSM + Contrastive loss : $ python --cfg cfg/DAMSM/bird.yml --gpu 0

  • With COCO2014 using DAMSM + Contrastive loss : $ python --cfg cfg/DAMSM/coco.yml --gpu 0

  1. Training DM-GAN
  • With CUBS2011 : $ python --cfg cfg/clip_bird_DMGAN.yml --gpu 0

  • With COCO2014 : $ python --cfg cfg/clip_coco_DMGAN.yml --gpu 0

6. Evaluation

  1. Generate fake images and compute R precision
  • CUBS2011 : $ python --cfg cfg/eval_clip_bird.yml

  • COCO2014 : $ python --cfg cfg/eval_clip_coco.yml

  1. Compute FID(Frechet Inception Distance)
  • CUBS2011 : $ python --data bird --dims 2048 --batch_size 32

  • COCO2014 : $ python --data coco --dims 2048 --batch_size 32

  1. Compute Inception score
  • CUBS2011 : $ python --data bird

  • COCO2014 : $ python --data coco

7. Citation