/DIVA

Diffusion Feedback Helps CLIP See Better

Primary LanguagePythonMIT LicenseMIT

Wenxuan Wang1,2,3*, Quan Sun3*, Fan Zhang3, Yepeng Tang4, Jing Liu1,2, Xinlong Wang3

1CASIA, 2UCAS, 3BAAI, 4BJTU
* Equal Contribution

⏰ Schedule

[2024-08-07] We release CLIP model weights ! 💥

[2024-08-05] We release training & evaluation code ! 💥

[2024-07-30] Our paper is released on arXiv ! 💥

💡 Motivation

overview

In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (w/o corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ↑), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that DIVA preserves CLIP's strong zero-shot capabilities.

🤖 Architecture

overview

Given an image, the CLIP model encodes the visual features as the main part of condition, then the generative diffusion model predicts the added noise taking the noisy image and condition as input. We optimize the CLIP's representation by maximizing the image likelihood with the diffusion loss via generative feedback.

🔨 Installation

Clone this repository and install the required packages:

git clone https://github.com/baaivision/DIVA.git
cd DIVA
mkdir -p outputs logs datasets pretrained_weights/CLIP pretrained_weights/SD

conda create -n diva python=3.9
conda activate diva
pip install -r requirements.txt

Core packages:

🍹 Preparation for DIVA's Generative Fine-tuning

Data Acquisition

For data preparation, please refer to image2dataset and MMVP for the employed training and evaluation data in this work. After collecting the corresponding datasets, directly put them into the dataset/ folder path.

Pre-trained Weight Downloading

As for pre-trained weight preparation, please refer to OpenAI ViT-L-14/224&336, MetaCLIP ViT-L/H-14, SigLIP ViT-SO-14/224, SigLIP ViT-SO-14/384, DFN ViT-H-14/224, DFN ViT-H-14/378 and SD-2-1-base to acquire the model weights for discriminative CLIP models and the leveraged diffusion model that provides generative feedback. After downloading all these necessary weights, move them respectively to the corresponding folder path pretrained_weights/CLIP/ and pretrained_weights/SD/.

Code Modification

For the preparation for our DIVA's condition design, some source code in the installed CLIP and OpenCLIP packages need to be modified.

For OpenAI CLIP, use the content in our provided condition/OpenAICLIP_for_clip_model.py to replace the content in Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/clip/model.py.

For MetaCLIP and DFN, use the content in our provided condition/MetaCLIP_for_openclip_transformer.py and condition/DFN_for_openclip_transformer.py to replace the content in Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/open_clip/transformer.py, respectively.

For SigLIP, use the content in our provided condition/SigLIP_for_timm_models_visiontransformer.py to replace the content in Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/timm/models/vision_transformer.py.

🍻 Quick Start for Training & Evaluation

After all the above preparation steps, you can simply start training for our DIVA with the following command:

# For OpenAICLIP
bash DIVA_for_OpenAICLIP.sh

# For MetaCLIP
bash DIVA_for_MetaCLIP.sh

# For SigLIP
bash DIVA_for_SigLIP.sh

# For DFN
bash DIVA_for_DFN.sh

Model Zoo

Method Image Size Params (M) Average Score
OpenAI ViT-L-14 224² 427.6 25.9 (+6.6)
OpenAI ViT-L-14 336² 427.9 25.2 (+5.2)
MetaCLIP ViT-L-14 224² 427.6 27.4 (+3.7)
MetaCLIP ViT-H-14 224² 986.1 31.9 (+6.7)
SigLIP ViT-SO-14 224² 877.4 40.7 (+2.9)
SigLIP ViT-SO-14 384² 878.0 38.5 (+1.5)
DFN ViT-H-14 224² 986.1 43.7 (+4.4)
DFN ViT-H-14 378² 986.7 37.8 (+3.0)

It is worth noting that, due to the randomness among the introduced condition design during the training phase and the selection of local patch tokens during the inference phase for OpenAI CLIP, the obtained scores on MMVP_VLM benchmark using our provided OpenAI CLIP weights might not be the same as the reported results in our paper. At this time, we recommend trying different random seeds multiple times if the scores do not meet expectations.

🎨 Visualization

scene

💙 Acknowledgement

DIVA is built upon the awesome Diffusion-TTA, MMVP, CLIP, OpenCLIP, timm.

📝 Citation

@article{wang2024diffusion,
      title={Diffusion Feedback Helps CLIP See Better},
      author={Wang, Wenxuan and Sun, Quan and Zhang, Fan and Tang, Yepeng and Liu, Jing and Wang, Xinlong},
      journal={arXiv preprint arXiv:2407.20171},
      year={2024}
}