Multimodal Predicate-Noun Dependencies

Code for the EMNLP 2022 paper "Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?" by Mitja Nikolaus, Emmanuelle Salin, Stephane Ayache, Abdellah Fourtassi, and Benoit Favre.

Installation

git clone --recursive <repo>

Data

All images with image ids occurring in data/sentence-semantics/eval_set.json have to be downloaded to ~/data/multimodal_evaluation/images. Feel free to contact me for sharing a zip of the relevant images and pre-extracted image features!

Python Environments:

vl-eval-pytorch-1.8.1

For LXMERT, UNITER, ViLT

conda env create --file environment_vl-eval-pytorch-1.8.1.yml 
cd src ViLT && pip install . && cd -

vl-eval-vilbert

conda env create --file environment_vl-eval-vilbert
conda activate vl-eval-vilbert
cd src/vilbert-multi-task && python setup.py build develop && cd -

vl-eval-vinvl

conda env create --file environment_vl-eval-vinvl
conda activate vl-eval-vinvl
cd src/Oscar && python setup.py build develop && cd -
python -m pip install git+https://github.com/facebookresearch/maskrcnn-benchmark.git

clip

conda env create --file environment_vl-eval-clip.yml
conda activate vl-eval-clip
pip install git+https://github.com/openai/CLIP.git

volta

For models trained in controlled conditions (VOLTA framework)

conda env create --file environment_vl-eval-volta.yml
conda activate vl-eval-volta
cd src/volta && python setup.py develop && cd -

Models

LXMERT

Environment: vl-eval-pytorch-1.8.1
Checkpoint: unc-nlp/lxmert-base-uncased
Image features: Bottom-up (36 boxes)

conda activate vl-eval-pytorch-1.8.1
python eval_sentence_semantics.py --model LXMERT --img-features-path ~/data/multimodal_evaluation/image_features_2048/img_features_2048.tsv
python eval_sentence_semantics.py --model LXMERT --cropped --img-features-path ~/data/multimodal_evaluation/image_features_2048/img_cropped_features_2048.tsv

UNITER

Environment: vl-eval-pytorch-1.8.1
Checkpoint: https://github.com/ChenRocks/UNITER/blob/master/scripts/download_pretrained.sh
Image features: Bottom-up (36 boxes)

conda activate vl-eval-pytorch-1.8.1
python eval_sentence_semantics.py --model UNITER --img-features-path ~/data/multimodal_evaluation/image_features_2048/img_features_2048.tsv
python eval_sentence_semantics.py --model UNITER --cropped --img-features-path ~/data/multimodal_evaluation/image_features_2048/img_cropped_features_2048.tsv

ViLT

Environment: vl-eval-pytorch-1.8.1
Checkpoint: https://github.com/dandelin/ViLT/releases/download/200k/vilt_200k_mlm_itm.ckpt
Image features: extracted within model

conda activate vl-eval-pytorch-1.8.1
python eval_sentence_semantics.py --model VILT --images-dir ~/data/multimodal_evaluation/images
python eval_sentence_semantics.py --model VILT --cropped --images-dir ~/data/multimodal_evaluation/images_cropped

Oscar

Environment: vl-eval-vinvl
Checkpoint: https://biglmdiag.blob.core.windows.net/oscar/exp/retrieval/base/checkpoint.zip
Image features: Bottom-up (10 to 100 boxes)

conda activate vl-eval-vinvl
python eval_sentence_semantics.py --model Oscar --img-features-path ~/data/multimodal_evaluation/image_features_2048/img_features_2048_10_100.tsv
python eval_sentence_semantics.py --model Oscar --cropped --img-features-path ~/data/multimodal_evaluation/image_features_2048/img_cropped_features_2048_10_100.tsv

VinVL

Environment: vl-eval-vinvl
Checkpoint: https://github.com/microsoft/Oscar/blob/master/VinVL_DOWNLOAD.md
Image features: extracted using Vision Transformer (ViT)

conda activate vl-eval-vinvl
python eval_sentence_semantics.py --model VINVL --img-features-path ~/data/multimodal_evaluation/image_features_vinvl/
python eval_sentence_semantics.py --model VINVL --cropped --img-features-path ~/data/multimodal_evaluation/image_features_vinvl_cropped/

ViLBERT

Environment: vl-eval-vilbert
Checkpoint: https://dl.fbaipublicfiles.com/vilbert-multi-task/pretrained_model.bin
Image features: from faster r-cnn (https://gitlab.com/vedanuj/vqa-maskrcnn-benchmark)

conda activate vl-eval-vilbert
python eval_sentence_semantics.py --model VILBERT --img-features-path ~/data/multimodal_evaluation/image_features_vilbert/img_features.p
python eval_sentence_semantics.py --model VILBERT --cropped --img-features-path ~/data/multimodal_evaluation/image_features_vilbert/img_cropped_features.p

CLIP

Environment: vl-eval-clip
Checkpoint: ViT-B/32
Image features: extracted within model

conda activate vl-eval-clip
python eval_sentence_semantics.py --model CLIP --images-dir ~/data/multimodal_evaluation/images
python eval_sentence_semantics.py --model CLIP --cropped --images-dir ~/data/multimodal_evaluation/images_cropped

VOLTA

Environment: vl-eval-volta
Image features: Bottom-up (36 boxes)

Example eval for VisualBERT:

conda activate vl-eval-volta
python eval_sentence_semantics_volta.py --from_pretrained ~/data/volta/VisualBERT --config_file src/volta/config/ctrl_visualbert_base.json --img-features-path ~/data/multimodal_evaluation/image_features_2048/img_features_2048.tsv
python eval_sentence_semantics_volta.py --from_pretrained ~/data/volta/VisualBERT --config_file src/volta/config/ctrl_visualbert_base.json --cropped --img-features-path ~/data/multimodal_evaluation/image_features_2048/img_cropped_features_2048.tsv

Analyses

Analysis scripts can be run after model results have been saved to runs/sentence_semantics.

Detailed results

Generate per-concept results plots (and more):

python plot_sentence_semantics_results.py --input-file runs/sentence-semantics/LXMERT/results.csv

Correlations

(First, download Train_GCC-training.tsv and safe it to data/conceptual_captions/.)

Run correlations between common predictors and model performance:

python sentence_semantics_correlations.py --models LXMERT UNITER VILBERT VILT VINVL CLIP

mitjanikolaus/multimodal-predicate-noun-dependencies