PPE ✨

PyTorch implementation of our CVPR'2022 paper:

Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model. Zipeng Xu, Tianwei Lin, Hao Tang, Fu Li, Dongliang He, Nicu Sebe, Radu Timofte, Luc Van Gool, Errui Ding. To appear in CVPR 2022.

This code is reimplemented based on the orpatashnik/StyleCLIP. We thank for their open sourcing.

We also have a PaddlePaddle implementation here.

Updates

24 Mar 2022: Update our arxiv-version paper.

26 Mar 2022: Release code for reimplementing the experiments in the paper.

30 Mar 2022: Create this new repository for the Pytorch implementation.

To be continued...

To reproduce our results:

Setup:

Same as StyleCLIP, the setup is as follows:

Install CLIP:

conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=<CUDA_VERSION>
pip install ftfy regex tqdm gdown
pip install git+https://github.com/openai/CLIP.git

Download pre-trained models:

The code relies on the Rosinality pytorch implementation of StyleGAN2. Download the pre-trained StyleGAN2 generator from here.

The training also needs the weights for the facial recognition network used in the ID loss. Download the weights from here.
Invert real images:

The mapper is trained on latent vectors, so it is necessary to invert images into latent space. To edit human face, StyleCLIP provides the CelebA-HQ that was inverted by e4e: train set, test set.

Usage:

All procedures are conducted under the mapper directory, so please run:

cd mapper
mkdir preprocess

Predict

Aggregate the images that are most relevant to the text command:
```
python scripts/randc.py --cmd "black hair"
```
Find the attributes that appear most frequently in the command-relevant images:
```
python scripts/find_ancs.py --cmd "black hair"
```

Prevent

Train the mapper network with Entanglement Loss based on the found attributes (we call it "anchors" colloquially):

python scripts/train.py --exp_dir ../results/black_hair_ppe --description "black hair" --anchors 'short eyebrows','with bangs','short hair','black eyes','narrow eyes','high cheekbones','with lipstick','pointy face','sideburns','with makeup' --tar_dist 0.1826171875

Evaluate

Evaluate the manipulation with our evaluation metric:

python scripts/evaluate.py --exp_dir ../results/black_hair_ppe --description "black hair" --anchors 'short eyebrows','with bangs','short hair','black eyes','narrow eyes','high cheekbones','with lipstick','pointy face','sideburns','with makeup' --tar_dist 0.1826171875

Reference

@article{xu2022ppe,
author = {Zipeng Xu and Tianwei Lin and Hao Tang and Fu Li and Dongliang He and Nicu Sebe and Radu Timofte and Luc Van Gool and Errui Ding},
title = {Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model},
journal = {arXiv preprint arXiv:2111.13333},
year = {2021}
}

Please contact zipeng.xu@unitn.it if you have any question.

zipengxuc/PPE-Pytorch