【ICCV'2023 🔥】DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

The implementation of the paper DiffusionRet: Generative Text-Video Retrieval with Diffusion Model.

In this paper, we propose a novel diffusion-based text-video retrieval framework, called DiffusionRet, which addresses the limitations of current discriminative solutions from a generative perspective.

📌 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@inproceedings{jin2023diffusionret,
  title={DiffusionRet: Generative Text-Video Retrieval with Diffusion Model},
  author={Jin, Peng and Li, Hao and Cheng, Zesen and Li, Kehan and Ji, Xiangyang and Liu, Chang and Yuan, Li and Chen, Jie},
  booktitle={ICCV},
  pages={2470-2481},
  year={2023}
}

💡 I also have other text-video retrieval projects that may interest you ✨.

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
Accepted by CVPR 2023 (Highlight) | [HBI Code]
Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
Accepted by NeurIPS 2022 | [EMCL Code]
Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David Clifton, Jie Chen

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment
Accepted by IJCAI 2023 | [DiCoSA Code]
Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen

📣 Updates

[2023/08/27]: We release the training code.
[2023/07/14]: Our DiffusionRet has been accepted by ICCV 2023! We will release the training code asap.
[2023/06/28]: We release the inference code.
[2023/03/31]: Our paper is under review. After our paper is published, we will release the code as soon as possible.

📕 Overview

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise.

🚀 Quick Start

Setup

Setup code environment

conda create -n DiffusionRet python=3.9
conda activate DiffusionRet
pip install -r requirements.txt
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Download CLIP Model

cd DiffusionRet/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt

Download Datasets

Datasets	Google Cloud	Baidu Yun	Peking University Yun
MSR-VTT	Download	Download	Download
MSVD	Download	Download	Download
ActivityNet	TODO	Download	Download
DiDeMo	TODO	Download	Download

Model Zoo

Checkpoint	Google Cloud	Baidu Yun	Peking University Yun
MSR-VTT	Download	Download	Download
ActivityNet	Download	Download	Download

Evaluate

Eval on MSR-VTT

CUDA_VISIBLE_DEVICES=0 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=1 \
eval.py \
--workers 8 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/MSRVTT_Videos \
--datatype msrvtt \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--diffusion_steps 50 \
--noise_schedule cosine \
--init_model ${CHECKPOINT_PATH} \
--output_dir ${OUTPUT_PATH}

Eval on ActivityNet Captions

CUDA_VISIBLE_DEVICES=0,1 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=2 \
eval.py \
--workers 8 \
--batch_size_val 128 \
--anno_path ${DATA_PATH}/ActivityNet \
--video_path ${DATA_PATH}/ActivityNet/Activity_Videos \
--datatype activity \
--max_words 64 \
--max_frames 64 \
--video_framerate 1 \
--diffusion_steps 50 \
--noise_schedule cosine \
--init_model ${CHECKPOINT_PATH} \
--output_dir ${OUTPUT_PATH}

Train

Discrimination Pretrain

Train the feature extractor from the discrimination perspective.

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=4 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 50 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/MSRVTT_Videos \
--datatype msrvtt \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--stage discrimination \
--output_dir ${OUTPUT_PATH}

Generation Finetune

Optimize the generator from the generation perspective.

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m torch.distributed.launch \
--master_port 2502 \
--nproc_per_node=4 \
main_retrieval.py \
--do_train 1 \
--workers 8 \
--n_display 50 \
--epochs 5 \
--lr 1e-4 \
--coef_lr 1e-3 \
--batch_size 128 \
--batch_size_val 128 \
--anno_path data/MSR-VTT/anns \
--video_path ${DATA_PATH}/MSRVTT_Videos \
--datatype msrvtt \
--max_words 32 \
--max_frames 12 \
--video_framerate 1 \
--stage generation \
--diffusion_steps 50 \
--noise_schedule cosine \
--init_model ${CHECKPOINT_PATH} \
--output_dir ${OUTPUT_PATH}

🎗️ Acknowledgments

Our code is based on EMCL, CLIP, CLIP4Clip and DRL. We sincerely appreciate for their contributions.

jpthu17/DiffusionRet