/ALADIN

Official implementation of the paper "ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval"

Primary LanguagePythonMIT LicenseMIT

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Introduction

This is the code for reproducing the results from our paper ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval, accepted at CBMI 2022.

Our code is based on OSCAR, whose repository is available here.

Architecture

Installation

Requirements

  • Python 3.7
  • Pytorch 1.2
  • torchvision 0.4.0
  • cuda 10.0

Setup with Conda

# create a new environment
conda create --name oscar python=3.7
conda activate oscar

# install pytorch1.2
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch

export INSTALL_DIR=$PWD

# install apex
cd $INSTALL_DIR
git clone https://github.com/NVIDIA/apex.git
cd apex
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
python setup.py install --cuda_ext --cpp_ext

# install this repo
cd $INSTALL_DIR
git clone --recursive https://github.com/mesnico/OSCAR-TERAN-distillation
cd OSCAR-TERAN-distillation/coco_caption
./get_stanford_models.sh
cd ..
python setup.py build develop

# install requirements
pip install -r requirements.txt

unset INSTALL_DIR

Download OSCAR & Vin-VL Retrieval data:

Download the checkpoint folder with azcopy:

path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/checkpoint-0132780/' <checkpoint-target-folder> --recursive

Download the IR data

path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/datasets/coco_ir' <data-folder> --recursive

Download the pre-extracted Bottom-Up features

path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/' <features-folder> --recursive

Training

cd alad 
python train.py --data_dir <data-folder>/coco_ir --img_feat_file <features-folder>/features.tsv --eval_model_dir <checkpoint-target-folder>/checkpoint-0132780 --config configs/<config>.yaml --logger_name <output-folder> --val_step 7000 --max_seq_length 50 --max_img_seq_length 34

Configurations

The parameter --config is very important. Configurations are placed in yaml format inside the configs folder:

  • alad-alignment-triplet.yaml: Trains the alignment head using hinge-based triplet ranking loss, finetuning also the Vin-VL backbone;
  • alad-matching-triplet-finetune.yaml: Trains only the matching head using hinge-based triplet ranking loss. The parameter --load-teacher-model can be used to provide a backbone previously trained using the alad-alignment-triplet.yaml configuration;
  • alad-matching-distill-finetune.yaml: Trains only the matching head by distilling the scores from the alignment head. The parameter --load-teacher-model in this case IS NEEDED to provide a correctly trained alignment head, previously trained using the alad-alignment-triplet.yaml configuration;
  • alad-matching-triplet-e2e.yaml: Trains the matching head, finetuning also the Vin-VL backbone;
  • alad-alignment-and-matching-distill.yaml: Trains the whole architecture (matching+alignment heads) end-to-end. The variable activate_distillation_after inside the configuration file controls how many epochs to wait before activating the distillation loss (wait that the backbone is minimally stable); alternatively, you can load a pre-trained backbone using the --load-teacher-model option.

Monitor Training

Training and validation metrics, as well as model checkpoints are put inside the <output-folder> path. You can live monitor all the metrics using tensorboard:

tensorboard --logdir <output-folder>

Testing

The following script tests a model on the 1k MS-COCO test set (you can download our best model from here; it is obtained with the alad-alignment-and-matching-distill.yaml configuration.)

cd alad
python test.py --data_dir <data-folder>/coco_ir --img_feat_file <features-folder>/features.tsv --eval_model_dir <checkpoint-target-folder>/checkpoint-0132780 --max_seq_length 50 --max_img_seq_length 34 --eval_img_keys_file test_img_keys_1k.tsv --load_checkpoint <path/to/checkpoint.pth.tar>

To test on 5k test set, simply set --eval_img_keys_file test_img_keys.tsv.

Reference

If you found this code useful, please cite the following paper:

@inproceedings{messina2022aladin,
  title={ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval},
  author={Messina, Nicola and Stefanini, Matteo and Cornia, Marcella and Baraldi, Lorenzo and Falchi, Fabrizio and Amato, Giuseppe and Cucchiara, Rita},
  booktitle={International Conference on Content-based Multimedia Indexing},
  pages={64--70},
  year={2022}
}