Official PyTorch implementation of the paper "Improving Visual-Semantic Embeddings with Adaptive Context-aware Pooling and Adaptive Clustering Objective"
train.py
for training the VSE+2AD model using various visual semnatic backbones on COCO and Flickr30K.
eval.py
for evaluting the pre-trained models on COCO and Flickr30K.
arguments.py
for controling parameters for training.
modules
includes various files for building VSE+2AD model, which are vse.py
, mlp.py
, txt_enc.py
, img_enc.py
,resnet.py
and adcap.py
.
voab.py
for building or loading vocabularies.
logger.py
generates a logger for logging the both training and evaluating information.
utils.py
includes some basic function tools.
losses.py
for defining loss module, including adcto.
The key dependencies on Ubuntu 20 for both training and inference are as following:
- python 3.8.1
- pytorch 1.8.0
- Transformers 4.1.0
The original and precomputed data for COCO and Flickr30K, the pretrained weights and vocabulary should be considerd in data
file, which orgnized as following:
data
├── f30k # Flickr30K dataset
│ ├── precomp # pre-computed BUTD region features for Flickr30K
│ │ ├── train_ids.txt
│ │ ├── train_caps.txt
│ │ ├── ......
│ ├── images # original images
│ │ ├── xxx.jpg
│ │ └── ...
│ ├── id_mapping.json
│ │
│ │
├── coco # MS-COCO dataset
│ ├── precomp # pre-computed BUTD region features for MS-COCO
│ │ ├── train_ids.txt
│ │ ├── train_caps.txt
│ │ ├── ......
│ ├── images # original images
│ │ ├── train2014
│ │ │ ├── xxx.jpg
│ │ │ ├──......
│ │ ├── val2014
│ ├── id_mapping.json
│ │
│ │
├── vocab # the vocabulary file
│ ├── f30k_precomp_vocab.json
│ ├── coco_precomp_vocab.json
│ │
│ │
├── weights
│ ├── original_updown_backbone.pth # the BUTD CNN weights
The download links for original COCO/F30K images, precomputed BUTD features, and corresponding vocabularies are from the offical repo of SCAN.
weights/original_updowmn_backbone.pth
is the pre-trained ResNet-101 weights from Bottom-up Attention Model.
Run train.py
to evaluate specified models on either COCO and Flickr30K.
For training bigru
as textual backbone and project
refers to the BUTD features with simple projection on Flickr30K or COCO, using the following command:
python train.py --log_name bigru_project_f30k \
--data_name f30k/coco \
--precomp_enc_type basic \
--batch_size 128 \
--txt_enc_type rnn \
--use_bigru \
--img_enc_type project \
--loss_type acc
For fine-tune bigru
and butd
on Flickr30K, using the following command:
python train.py --log_name bigru_project_f30k \
--data_name f30k/coco \
--precomp_enc_type backbone \
--batch_size 128 \
--txt_enc_type rnn \
--use_bigru \
--img_enc_type butd \
--backbone_lr_factor 0.05 \
--loss_type acc
For fine-tune bigru
and vit
on Flickr30K, using the following command:
python train.py --log_name bigru_project_f30k \
--data_name f30k/coco \
--precomp_enc_type backbone \
--batch_size 128 \
--txt_enc_type rnn \
--use_bigru \
--img_enc_type vit \
--vit_type google/vit-base-patch16-224 \
--backbone_lr_factor 0.05 \
--loss_type acc
For fine-tune BERT
as textual backbone, the visual backbone is as same as above,
python train.py --... \
--txt_enc_type bert \
--bert_type bert-base-uncased \
--...
Run eval.py
to evaluate specified models on either COCO and Flickr30K. The supported language encoder are TName={bigru/BERT}
, and vision encoder are IName={project/butd/vit}
.
For evaluting pre-trained models on Flickr-30K, use the command:
python eval.py --data_name f30k \
--txt_enc_type TName \
--img_enc_type IName
For evaluting pre-trained models on MS-COCO, use the command:
python eval.py --data_name coco \
--txt_enc_type TName \
--img_enc_type IName