/Multimodal-Alignment-Framework

Implementation for MAF: Multimodal Alignment Framework

Primary LanguagePython

Multimodal Alignment Framework

Implementation of MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding.

Some of our code is based on ban-vqa. Thanks!

TODO provide Faster R-CNN feature extraction script.

Prerequisites

  • python 3.7
  • pytorch 1.4.0

Data

Flickr30k Entities

We use flickr30k dataset to train and validate our model.

the raw dataset can be found at Flickr30k Entites Annotations

Run sh tools/prepare_data.sh to downloaded and process Flickr30k Annotations, Images and Glove word embeddings.

Object proposals

Donwload object proposals:

We use an off-the-shelf faster-rcnn pretrained on Visual Genome to generate objects proposals and labels. We use Bottom-Up Attention for visual features.

As Issue#1 pointed out, there is some inconsistency between features generated using our script (faster-rcnn) and Bottom-Up Attention. We therefore upload our generated features.

Download train_features_compress.hdf5(6GB), val features_compress.hdf5, and test features_compress.hdf5 to data/flickr30k.

alternative link for train_feature.hdf5 (20GB, same features): google drive; baidu drive, code: n1yd.

Download train_detection_dict.json, val_detection_dict.json, and test_detection_dict.json and to data/.

Generate object proposals by yourself(TODO)

run sh tools/prepare_detection.sh to clone faster-rcnn code and download pre-trained models.

run sh tools/run_faster_rcnn.sh to run faster-rcnn detection on flickr30k dataset and generate features.

you may have to customize your environment in order to run faster-rcnn successfully. See prerequisites

Training

python main.py [args]

In our experiments, we get a ~61% accuracy using the default setting.

Evaluating

Our trained model can be downloaded at google drive.

python test.py --file <saved model>