/ALIP

[ICCV 2023] ALIP: Adaptive Language-Image Pre-training with Synthetic Caption

Primary LanguagePythonMIT LicenseMIT

Author: Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, Tongliang Liu

arXiv PyTorch arXiv

Introduction

Adaptive Language-Image Pre-training (ALIP) is a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the training process. Meanwhile, the adaptive contrastive loss can effectively reduce the impact of noise data and enhances the efficiency of pre-training data. teaser

📣 News

  • (2023.8.15): ✨Code has been released❗️
  • (2023.7.14): ✨Our paper is accepted to ICCV2023❗️

Instructions

  • Environment installation

    pip install -r requirments.txt
    
  • Dataset preparation

    1、Download YFCC15M

    The YFCC15M dataset we used is YFCC15M-DeCLIP, we download it from the repo, finally we successful donwload 15061515 image-text pairs.

    2、Generate synthetic caption

    In our paper, we use OFA model to generate synthetic captions. You can download model weight and scripts from the OFA project.

    3、Generate rec files

    To improve the training efficience, we use MXNet to save the YFCC15M dataset to rec file, and use NVIDIA DALI to accelerate data loading and pre-processing. The sample code to generate rec files is in data2rec.py.

  • Pretrained Model Weight

    You can download the pretrained model weight from Google Drive or BaiduYun, and you can find the traning log in Google Drive or BaiduYun

  • Training

    Start training by run

    bash scripts/train_yfcc15m_B32_ALIP.sh
    
  • Evaluation

    Evaluate zero shot cross-modal retireval

    bash run_retrieval.sh
    

    Evaluate zero shot classification

    bash run_zeroshot.sh
    
  • Results

    zero shot cross modal retrieval

    Method Model MSCOCO R@1 MSCOCO R@5 MSCOCO R@10 Flickr30k R@1 Flickr30k R@5 Flickr30k R@10
    CLIP B/32 20.8/13.0 43.9/31.7 55.7/42.7 34.9/23.4 63.9/47.2 75.9/58.9
    SLIP B/32 27.7/18.2 52.6/39.2 63.9/51.0 47.8/32.3 76.5/58.7 85.9/68.8
    DeCLIP B/32 28.3/18.4 53.2/39.6 64.5/51.4 51.4/34.3 80.2/60.3 88.9/70.7
    UniCLIP B32 32.0/20.2 57.7/43.2 69.2/54.4 52.3/34.8 81.6/62.0 89.0/72.0
    HiCLIP B/32 34.2/20.6 60.3/43.8 70.9/55.3 —— —— ——
    HiDeCLIP B/32 38.7/23.9 64.4/48.2 74.8/60.1 —— —— ——
    ALIP B/32 46.8/29.3 72.4/54.4 81.8/65.4 70.5/48.9 91.9/75.1 95.7/82.9

    zero shot classification

    Method Model CIFAR10 CIFAR100 Food101 Pets Flowers SUN397 Cars DTD Caltech101 Aircraft Imagenet Average
    CLIP B/32 63.7 33.2 34.6 20.1 50.1 35.7 2.6 15.5 59.9 1.2 32.8 31.8
    SLIP B/32 50.7 25.5 33.3 23.5 49.0 34.7 2.8 14.4 59.9 1.7 34.3 30.0
    FILIP B/32 65.5 33.5 43.1 24.1 52.7 50.7 3.3 24.3 68.8 3.2 39.5 37.2
    DeCLIP B/32 66.7 38.7 52.5 33.8 60.8 50.3 3.8 27.7 74.7 2.1 43.2 41.3
    DeFILIP B/32 70.1 46.8 54.5 40.3 63.7 52.4 4.6 30.2 75.0 3.3 45.0 44.2
    HiCLIP B/32 74.1 46.0 51.2 37.8 60.9 50.6 4.5 23.1 67.4 3.6 40.5 41.8
    HiDeCLIP B/32 65.1 39.4 56.3 43.6 64.1 55.4 5.4 34.0 77.0 4.6 45.9 44.6
    ALIP B/32 83.8 51.9 45.4 30.7 54.8 47.8 3.4 23.2 74.1 2.7 40.3 41.7

Acknowledgement

This project is based on open_clip and OFA, thanks for their works.

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Citation

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{yang2023alip,
      title={ALIP: Adaptive Language-Image Pre-training with Synthetic Caption}, 
      author={Kaicheng Yang and Jiankang Deng and Xiang An and Jiawei Li and Ziyong Feng and Jia Guo and Jing Yang and Tongliang Liu},
      year={2023},
      eprint={2308.08428},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🌟Star History

Star History Chart