UpDown Captioner Baseline for `nocaps`

Baseline model for nocaps benchmark, a re-implementation based on the UpDown image captioning model trained on the COCO dataset (only).

Checkout our package documentation at nocaps.org/updown-baseline!

If you find this code useful, please consider citing:

@inproceedings{nocaps2019,
  author    = {Harsh Agrawal* and Karan Desai* and Yufei Wang and Xinlei Chen and Rishabh Jain and
             Mark Johnson and Dhruv Batra and Devi Parikh and Stefan Lee and Peter Anderson},
  title     = {{nocaps}: {n}ovel {o}bject {c}aptioning {a}t {s}cale},
  booktitle = {International Conference on Computer Vision (ICCV)},
  year      = {2019}
}

As well as the paper that proposed this model:

@inproceedings{Anderson2017up-down,
  author    = {Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson
               and Stephen Gould and Lei Zhang},
  title     = {Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},
  booktitle = {CVPR},
  year      = {2018}
}

How to setup this codebase?

This codebase requires Python 3.6+ or higher. It uses PyTorch v1.1, and has out of the box support with CUDA 9 and CuDNN 7. The recommended way to set this codebase up is through Anaconda or Miniconda. However, it should work just as fine with VirtualEnv.

Install Dependencies

Install Anaconda or Miniconda distribution based on Python3+ from their downloads' site.
Clone the repository.

git clone https://www.github.com/nocaps-org/updown-baseline
cd updown-baseline

Create a conda environment and install all the dependencies, and this codebase as a package in development version.

conda create -n updown python=3.6
conda activate updown
pip install -r requirements.txt
python setup.py develop

Note: If evalai package install fails, install these packages and try again:

sudo apt-get install libxml2-dev libxslt1-dev

Now you can import updown from anywhere in your filesystem as long as you have this conda environment activated.

Download Image Features

We provide pre-extracted bottom-up features for COCO and nocaps splits. These are extracted using a Faster-RCNN detector pretrained on Visual Genome (Anderson et al. 2017). We extract features from 100 region proposals for an image, and select them based on a confidence threshold of 0.2 - we finally get 10-100 features per image (adaptive).

Download (or symlink) the image features under $PROJECT_ROOT/data directory:

coco_train2017, coco_val2017, nocaps_val, nocaps_test.

Download Annotations

Download COCO captions and nocaps val/test image info and arrange in a directory structure as follows:

$PROJECT_ROOT/data
    |-- coco
    |   +-- annotations
    |       |-- captions_train2017.json
    |       +-- captions_val2017.json
    +-- nocaps
        +-- annotations
            |-- nocaps_val_image_info.json
            +-- nocaps_test_image_info.json

COCO captions: http://images.cocodataset.org/annotations/annotations_trainval2017.zip
nocaps val image info: https://s3.amazonaws.com/nocaps/nocaps_val_image_info.json
nocaps test image info: https://s3.amazonaws.com/nocaps/nocaps_test_image_info.json

Vocabulary

Build caption vocabulary using COCO train2017 captions.

python scripts/build_vocabulary.py -c data/coco/captions_train2017.json -o data/vocabulary

Evaluation Server

nocaps val and test splits are held privately behind EvalAI. To evaluate on nocaps, create an account on EvalAI and get the auth token from profile details. Set the token through EvalAI CLI as follows:

evalai set_token <your_token_here>

You are all set to use this codebase!

Training

We manage experiments through config files -- a config file should contain arguments which are specific to a particular experiment, such as those defining model architecture, or optimization hyperparameters. Other arguments such as GPU ids, or number of CPU workers should be declared in the script and passed in as argparse-style arguments. Train a baseline UpDown Captioner with all the default hyperparameters as follows. This would reproduce results of the first row in nocaps val/test tables from our paper.

python scripts/train.py \
    --config-yml configs/updown_nocaps_val.yaml \
    --gpu-ids 0 --serialization-dir checkpoints/updown-baseline

Refer updown/config.py for default hyperparameters. For other configurations, pass a path to config file through --config-yml argument, and/or a set of key-value pairs through --config-override argument. For example:

python scripts/train.py \
    --config-yml configs/updown_nocaps_val.yaml \
    --config-override OPTIM.BATCH_SIZE 250 \
    --gpu-ids 0 --serialization-dir checkpoints/updown-baseline

Multi-GPU Training

Multi-GPU training is fully supported, pass GPU IDs as --gpu-ids 0 1 2 3.

Saving Model Checkpoints

This script serializes model checkpoints every few iterations, and keeps track of best performing checkpoint based on overall CIDEr score. Refer updown/utils/checkpointing.py for more details on how checkpointing is managed. A copy of configuration file used for a particular experiment is also saved under --serialization-dir.

Logging

This script logs loss curves and metrics to Tensorboard, log files are at --serialization-dir. Execute tensorboard --logdir /path/to/serialization_dir --port 8008 and visit localhost:8008 in the browser.

Evaluation and Inference

Generate predictions for nocaps val or nocaps test using a pretrained checkpoint:

python scripts/inference.py \
    --config-yml /path/to/config.yaml \
    --checkpoint-path /path/to/checkpoint.pth \
    --output-path /path/to/save/predictions.json \
    --gpu-ids 0

Add --evalai-submit flag if you wish to submit the predictions directly to EvalAI and get results.

Results

Pre-trained checkpoint with the provided config is available to download here:

Checkpoint (.pth file): https://bit.ly/2JwuHcP
Predictions on nocaps val: https://bit.ly/2YKxxBA
Predictions on nocaps test: https://bit.ly/2XBs0R4

	in-domain		near-domain		out-of-domain		overall
split	CIDEr	SPICE	CIDEr	SPICE	CIDEr	SPICE	BLEU1	BLEU4	METEOR	ROUGE	CIDEr	SPICE
val	78.1	11.6	57.7	10.3	31.3	8.3	73.7	18.3	22.7	50.4	55.3	10.1
test	74.3	11.5	56.9	10.3	30.1	8.1	74.0	19.2	23.0	51.0	54.3	10.1

Gitsamshi/updown-baseline

UpDown Captioner Baseline for nocaps