iSogCLR PyTorch Implementation

In this repo, we show how to train a self-supervised model by using Global Contrastive Loss (GCL) on a widely used bimodal image-text dataset CC3M.

Getting Started

Try in Colab: https://colab.research.google.com/drive/1FTF-cTcW11Gyrwu8uhTZOXgLsjp49Z9W?usp=sharing

Environment

Setting up a new virtual environment with Conda:

env_name='csce689_proj'
conda create -n "$env_name" python=3.10
conda activate "$env_name"
pip install -r requirements.txt

Training and Evaluation

  1. Download the data: cc3m_subset_100k.tar.gz, a 100k subset of the Conceptual Captions dataset; mscoco_val.tar.gz, a 5k subset of the COCO val2014 dataset; clip_train.tar.gz, captions of the previous datasets. The code and data should be structured as follows:
    .
    +--bimodal_exps (code)
    |
    +--clip_train (captions)
    |  +--cc3m_train_subset.json
    |  +--coco_val.json
    |
    +--datasets (images)
    |  +--cc3m_subset_100k
    |  +--mscoco_val
    
  2. To train a model on cc3m, use run.slurm if slurm is supported or run
    export PYTHONPATH="$PYTHONPATH:./bimodal_exps"
    export HUGGINGFACE_HUB_CACHE='./checkpoints/huggingface'
    
    data_path=./datasets
    ann_path=./clip_train
    train_image_root=cc3m_subset_100k/
    data=cc3m
    train_file=${data}_train_subset.json
    gamma=0.8
    epochs=30
    
    CUDA_VISIBLE_DEVICES=0 python ./bimodal_exps/clip.py \
        --data_path ${data_path} \
        --ann_path ${ann_path} \
        --train_file ${train_file} \
        --train_image_root ${train_image_root} \
        --output_dir output/isogclr_${data}_g${gamma}_e${epochs} \
        --init_model \
        --use_amp \
        --ita_type sogclr \
        --tau_init 0.01 \
        --sogclr_gamma ${gamma} \
        --eta_init 0.03 --sched cosine \
        --no-distributed \
        --epochs ${epochs}
  3. To test the performance of a model on mscoco, use eval.slurm if slurm is supported or run
    export PYTHONPATH="$PYTHONPATH:./bimodal_exps"
    export HUGGINGFACE_HUB_CACHE='./checkpoints/huggingface'
    
    data_path=./datasets
    ann_path=./clip_train
    train_image_root=cc3m_subset_100k/
    data=cc3m
    train_file=${data}_train_subset.json
    gamma=0.8
    epochs=30
    
    CUDA_VISIBLE_DEVICES=0 python ./bimodal_exps/clip.py \
        --data_path ${data_path} \
        --ann_path ${ann_path} \
        --train_file ${train_file} \
        --train_image_root ${train_image_root} \
        --output_dir output/isogclr_${data}_g${gamma}_e${epochs} \
        --init_model \
        --use_amp \
        --ita_type sogclr \
        --tau_init 0.01 \
        --sogclr_gamma ${gamma} \
        --eta_init 0.03 --sched cosine \
        --no-distributed \
        --epochs ${epochs} \
        --evaluate --checkpoint './output/isogclr_cc3m_g0.8_e30/checkpoint_30.pth'

Reference

If you find this tutorial helpful, please cite:

@inproceedings{qiu2023not,
  title={Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature Individualization},
  author={Zi-Hao Qiu, Quanqi Hu, Zhuoning Yuan, Denny Zhou, Lijun Zhang, and Tianbao Yang},
  booktitle={International Conference on Machine Learning},
  pages={TBD},
  year={2023},
  organization={PMLR}
}