/R2D2

Primary LanguagePythonApache License 2.0Apache-2.0

CCMB and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ CCMB: A Large-scale Chinese Cross-modal Benchmark (ACM MM 2023)

This repo is the official implementation of CCMB and R2D2.

CCMB is available. It include pre-train dataset (Zero) and 5 downstream datasets. The detailed introduction and download URL are in http://zero.so.com. The 250M data is in https://pan.baidu.com/s/1gnNbjOdCQdqZ4bRNN1S-Vw?pwd=iau8.

R2D2 is a vision-language framework. We release the following code and models:

โœ…Pre-trained checkpoints.

โœ…Inference demo.

โœ…Fine-tuning code and checkpoints for Image-Text Retrieval and Image-Text Matching tasks.

Performance

We show the performance of R2D2ViT-L fine-tuned on Flickr30k-CNA dataset. The output of R2D2 is a similarity score between 0 and 1.

ไธญๆ–‡ (English) ไน”ไธนๆŠ•็ฏฎ (Jordan shot) ไน”ไธน่ฟ็ƒ (Jordan dribble) ่ฉนๅง†ๆ–ฏๆŠ•็ฏฎ (James shot)
Similarity score 0.99033021 0.91078649 0.61231128

Requirements

pip install -r requirements.txt

Pre-trained checkpoints

Pre-trained image-text pairs R2D2ViT-L PRD2ViT-L
250M Download Download
23M Download -

Fine-tuned checkpoints

Dataset R2D2ViT-B(23M)
Flickr-CNA Download
IQR Download
ICR Download
IQM Download
ICM Download

Inference demo

  • To evaluate the pretrained R2D2 model on image-text pairs, run:
    python r2d2_inference_demo.py
  • To evaluate the pretrained PRD2 model on image-text pairs, run:
    python prd2_inference_demo.py

Downstream Tasks

  1. Download datasets and pretrained models. for ICR, IQR, ICM, IQM tasks, after downloading you should see the following folder structure:
    โ”œโ”€โ”€ IQR_IQM_ICR_ICM_images
    โ”‚   
    โ”œโ”€โ”€ IQR
    โ”‚   โ”œโ”€โ”€ train
    โ”‚   โ””โ”€โ”€ val
    โ”œโ”€โ”€ ICR
    โ”‚   โ”œโ”€โ”€ train
    โ”‚   โ””โ”€โ”€ val
    โ”œโ”€โ”€ IQM
    โ”‚   โ”œโ”€โ”€ train
    โ”‚   โ””โ”€โ”€ val
    โ”‚โ”€โ”€ ICM
    โ”‚   โ”œโ”€โ”€ train
    โ”‚   โ””โ”€โ”€ val
    for Flickr30k-CNA, after downloading you should see the following folder structure:
    
    โ”œโ”€โ”€ Flickr30k-images โ”‚
    โ”œโ”€โ”€ train โ”‚
    โ”œโ”€โ”€ val โ”‚
    โ””โ”€โ”€ test
  2. In config/retrieval_*.yaml, set the paths for the dataset and pretrain model paths.
  3. Run fine-tuning for the Image-Text Retrieval task.
    sh train_r2d2_retrieval.sh
    
  4. Run fine-tuning for the Image-Text Matching task.
    sh train_r2d2_matching.sh
    

Citation

If you find this dataset and code useful for your research, please consider citing.

@inproceedings{xie2023ccmb,
  title={CCMB: A Large-scale Chinese Cross-modal Benchmark},
  author={Xie, Chunyu and Cai, Heng and Li, Jincheng and Kong, Fanjing and Wu, Xiaoyu and Song, Jianfei and Morimitsu, Henrique and Yao, Lin and Wang, Dexin and Zhang, Xiangzheng and others},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={4219--4227},
  year={2023}
}