VisualBERT: A Simple and Performant Baseline for Vision and Language

This repository contains code for the paper VisualBERT: A Simple and Performant Baseline for Vision and Language (arxiv).

The repository is still under development. Please open up issues if you have any questions or comments.

In pytorch_pretrained_bert is a modified version of an early clone of HuggingFace's Pytorch BERT. The core part of VisualBERT is implemented mainly by modifying modeling.py. Two wrapper models are implemented in models/model.py, and code for loading different datasets are in dataloaders, where AllenNLP's Field and Instance are extensively used for wrapping data.

I borrowed and modified code from several repositeries, including but not limited to: R2C, Pythia, HugginFace BERT, BAN, Bottom-up and Top-down Attention, AllenNLP. I would like to expresse my gratitdue for authors of these repositeries.

Dependencies

Basic

Dependencies of this repository are similar to those of R2C. If you don't need to extract image features yourself or run on VCR, below are basic dependencies needed (assuming you are using a fresh conda environment):

conda install numpy pyyaml setuptools cmake cffi tqdm pyyaml scipy ipython mkl mkl-include cython typing h5py pandas nltk spacy numpydoc scikit-learn jpeg


#Please check your cuda version using `nvcc --version` and make sure the cuda version matches the cudatoolkit version.
conda install pytorch torchvision cudatoolkit=XXX -c pytorch


# Below is the way to install allennlp recommended in R2C. But in my experience, directly installing allennlp seems also okay.
pip install -r allennlp-requirements.txt
pip install --no-deps allennlp==0.8.0
python -m spacy download en_core_web_sm
pip install attrdict
pip install pycocotools
pip install commentjson

Extracting image features

Only install if you want to run on VCR or extract image features on your own.

A special version of pytorch vision that has ROIAlign layer:

pip install git+git://github.com/pytorch/vision.git@24577864e92b72f7066e1ed16e978e873e19d13d

Detectron

Troubleshooting

Below are some problems I have had when installing these dependencies:

pyyaml version.

Detectron might break when the pyyaml version is too high (not sure if this has been fixed now). Solution: install a lower version pip install pyyaml==3.12. (facebookresearch/Detectron#840, pypa/pip#5247)

Error when importing torchvision or ROIAlign layer.

Most likely version mismatch between cudatoolkit and cuda verison. Please specify the correct cudatoolkit verison in conda install pytorch torchvision cudatoolkit=XXX -c pytorch.

Segmentation fault when runing ResNet50 detector in VCR.

Most likely GCC version is too low. Need GCC version >= 5.0.

conda install -c psi4 gcc-5 seems to solve the problem. (https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/TROUBLESHOOTING.md)

Running the code

Assume that the folder XX is the parent directory of the code directory.

export PYTHONPATH=$PYTHONPATH:XX/visualbert/
export PYTHONPATH=$PYTHONPATH:XX/

cd XX/visualbert/models/

CUDA_VISIBLE_DEVICES=XXXX python train.py -folder [where you want to save the logs/models] -config XX/visualbert/configs/YYYY

In visualbert/configs are configs for different models on different datasets. Please change the data path (data_root) and model path(restore_bin, which is the path the model that you want to initialize from) in the config to match your local setting.

NLVR2

Prepare Data

Download our pre-computed features to a folder X_NLVR (Train, Val, Test-Public). The image features are from a model from Detectron (e2e_mask_rcnn_R-101-FPN_2x, model_id: 35861858). For downloading from Google Drive links in the command line, check out https://github.com/gdrive-org/gdrive.

Then download the three json files from the NLVR github page into X_NLVR.

For COCO pre-training, first download COCO caption annotations to a folder X_COCO.

cd X_COCO
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
unzip annotations_trainval2014.zip

Then download COCO image features to X_COCO. Train, Val.

COCO Pre-training

The corresponding config is visualbert/configs/nlvr2/coco-pre-train.json. Model checkpoint.

Task-specific Pre-training

The corresponding config is visualbert/configs/nlvr2/pre-train.json. Model checkpoint.

Fine-tuning

The corresponding config is visualbert/configs/nlvr2/fine-tune.json. Model checkpoint.

VQA

Prepare Data

The image features and VQA data are from Pythia. Assuming the data is stored in X_COCO.

cd X_COCO
cd data
wget https://dl.fbaipublicfiles.com/pythia/data/vocabulary_vqa.txt
wget https://dl.fbaipublicfiles.com/pythia/data/answers_vqa.txt
wget https://dl.fbaipublicfiles.com/pythia/data/imdb.tar.gz
gunzip imdb.tar.gz 
tar -xf imdb.tar

wget https://dl.fbaipublicfiles.com/pythia/features/detectron_fix_100.tar.gz
gunzip detectron_fix_100.tar.gz
tar -xf detectron_fix_100.tar
rm -f detectron_fix_100.tar

COCO Pre-training

The corresponding config is visualbert/configs/vqa/coco-pre-train.json. Model checkpoint.

Task-specific Pre-training

The corresponding config is visualbert/configs/vqa/pre-train.json. Model checkpoint.

Fine-tuning

The corresponding config is visualbert/configs/vqa/fine-tune.json. Model checkpoint.

VCR

Prepare Data

Download vcr images and annotations.

cd X_VCR
wget https://s3.us-west-2.amazonaws.com/ai2-rowanz/vcr1annots.zip
wget https://s3.us-west-2.amazonaws.com/ai2-rowanz/vcr1images.zip
unzip vcr1annots.zip
unzip vcr1images.zip

For COCO pre-training, first download raw COCO images:

cd X_COCO
wget http://images.cocodataset.org/zips/train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
unzip train2014.zip
unzip val2014.zip

Then download the detection results (boxes and masks) on COCO (Train, Val) from a large detector used when creating VCR dataset to X_COCO.

COCO Pre-training

The corresponding config is visualbert/configs/vcr/coco-pre-train.json. Model checkpoint.

Task-specific Pre-training

The corresponding config is visualbert/configs/vcr/pre-train.json. Model checkpoint.

Fine-tuning

The corresponding config is visualbert/configs/vcr/fine-tune.json. Model checkpoint.

Flickr30K

Coming soon!

Extract image features on your own

Extract features using Detectron for NLVR2

Dowload the corresponding config (XXX.yaml) and checkpoint (XXX.pkl) from Detectron. The model I used is 35861858.

Download NLVR2 images to a folder X_NLVR_IMAGE (you need to request them from the authors of NLVR2). https://github.com/lil-lab/nlvr/tree/master/nlvr2

Then run:

#SET = train/dev/test1
cd visualbert/utils/get_image_features
CUDA_VISIBLE_DEVICES=0 python extract_features_nlvr.py --cfg XXX.yaml --wts XXX.pkl --min_bboxes 150 --max_bboxes 150 --feat_name gpu_0/fc6 --output_dir X_NLVR --image-ext png X_NLVR_IMAGE/SET --no_id --one_giant_file X_NLVR/features_SET_150.th

Evaluation

Coming soon!

Visualization

Coming soon!

kent0304/visualbert

VisualBERT: A Simple and Performant Baseline for Vision and Language

Dependencies

Basic

Extracting image features

Troubleshooting

Running the code

NLVR2

Prepare Data

COCO Pre-training

Task-specific Pre-training

Fine-tuning

VQA

Prepare Data

COCO Pre-training

Task-specific Pre-training

Fine-tuning

VCR

Prepare Data

COCO Pre-training

Task-specific Pre-training

Fine-tuning

Flickr30K

Extract image features on your own

Extract features using Detectron for NLVR2

Evaluation

Visualization