Baseline model for nocaps
benchmark, a re-implementation based on the
UpDown image captioning model trained on the COCO dataset (only).
Checkout our package documentation at nocaps.org/updown-baseline!
If you find this code useful, please consider citing:
@inproceedings{nocaps2019,
author = {Harsh Agrawal* and Karan Desai* and Yufei Wang and Xinlei Chen and Rishabh Jain and
Mark Johnson and Dhruv Batra and Devi Parikh and Stefan Lee and Peter Anderson},
title = {{nocaps}: {n}ovel {o}bject {c}aptioning {a}t {s}cale},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2019}
}
As well as the paper that proposed this model:
@inproceedings{Anderson2017up-down,
author = {Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson
and Stephen Gould and Lei Zhang},
title = {Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},
booktitle = {CVPR},
year = {2018}
}
This codebase requires Python 3.6+ or higher. It uses PyTorch v1.1, and has out of the box support with CUDA 9 and CuDNN 7. The recommended way to set this codebase up is through Anaconda or Miniconda. However, it should work just as fine with VirtualEnv.
-
Install Anaconda or Miniconda distribution based on Python3+ from their downloads' site.
-
Clone the repository.
git clone https://www.github.com/nocaps-org/updown-baseline
cd updown-baseline
- Create a conda environment and install all the dependencies, and this codebase as a package in development version.
conda create -n updown python=3.6
conda activate updown
pip install -r requirements.txt
python setup.py develop
Note: If evalai
package install fails, install these packages and try again:
sudo apt-get install libxml2-dev libxslt1-dev
Now you can import updown
from anywhere in your filesystem as long as you have this conda environment activated.
We provide pre-extracted bottom-up features for COCO and nocaps
splits. These are extracted
using a Faster-RCNN detector pretrained on Visual Genome (Anderson et al. 2017). We extract
features from 100 region proposals for an image, and select them based on a confidence threshold
of 0.2 - we finally get 10-100 features per image (adaptive).
Download (or symlink) the image features under $PROJECT_ROOT/data
directory:
coco_train2017
, coco_val2017
,
nocaps_val
, nocaps_test
.
Download COCO captions and nocaps
val/test image info and arrange in a directory structure as follows:
$PROJECT_ROOT/data
|-- coco
| +-- annotations
| |-- captions_train2017.json
| +-- captions_val2017.json
+-- nocaps
+-- annotations
|-- nocaps_val_image_info.json
+-- nocaps_test_image_info.json
- COCO captions: http://images.cocodataset.org/annotations/annotations_trainval2017.zip
- nocaps val image info: https://s3.amazonaws.com/nocaps/nocaps_val_image_info.json
- nocaps test image info: https://s3.amazonaws.com/nocaps/nocaps_test_image_info.json
Build caption vocabulary using COCO train2017 captions.
python scripts/build_vocabulary.py -c data/coco/captions_train2017.json -o data/vocabulary
nocaps
val and test splits are held privately behind EvalAI. To evaluate on nocaps
, create an account on EvalAI and get the auth token from profile details. Set the token through EvalAI CLI as follows:
evalai set_token <your_token_here>
You are all set to use this codebase!
We manage experiments through config files -- a config file should contain arguments which are specific to a particular experiment, such as those defining model architecture, or optimization hyperparameters. Other arguments such as GPU ids, or number of CPU workers should be declared in the script and passed in as argparse-style arguments. Train a baseline UpDown Captioner with all the default hyperparameters as follows. This would reproduce results of the first row in nocaps
val/test tables from our paper.
python scripts/train.py \
--config-yml configs/updown_nocaps_val.yaml \
--gpu-ids 0 --serialization-dir checkpoints/updown-baseline
Refer updown/config.py
for default hyperparameters. For other configurations, pass a path to config file through --config-yml
argument, and/or a set of key-value pairs through --config-override
argument. For example:
python scripts/train.py \
--config-yml configs/updown_nocaps_val.yaml \
--config-override OPTIM.BATCH_SIZE 250 \
--gpu-ids 0 --serialization-dir checkpoints/updown-baseline
Multi-GPU training is fully supported, pass GPU IDs as --gpu-ids 0 1 2 3
.
This script serializes model checkpoints every few iterations, and keeps track of best performing checkpoint based on overall CIDEr score. Refer updown/utils/checkpointing.py for more details on how checkpointing is managed. A copy of configuration file used for a particular experiment is also saved under --serialization-dir
.
This script logs loss curves and metrics to Tensorboard, log files are at --serialization-dir
. Execute tensorboard --logdir /path/to/serialization_dir --port 8008
and visit localhost:8008
in the browser.
Generate predictions for nocaps
val or nocaps
test using a pretrained checkpoint:
python scripts/inference.py \
--config-yml /path/to/config.yaml \
--checkpoint-path /path/to/checkpoint.pth \
--output-path /path/to/save/predictions.json \
--gpu-ids 0
Add --evalai-submit
flag if you wish to submit the predictions directly to EvalAI and get results.
Pre-trained checkpoint with the provided config is available to download here:
- Checkpoint (
.pth
file): https://bit.ly/2JwuHcP - Predictions on
nocaps val
: https://bit.ly/2YKxxBA - Predictions on
nocaps test
: https://bit.ly/2XBs0R4
in-domain | near-domain | out-of-domain | overall | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
split | CIDEr | SPICE | CIDEr | SPICE | CIDEr | SPICE | BLEU1 | BLEU4 | METEOR | ROUGE | CIDEr | SPICE |
val | 78.1 | 11.6 | 57.7 | 10.3 | 31.3 | 8.3 | 73.7 | 18.3 | 22.7 | 50.4 | 55.3 | 10.1 |
test | 74.3 | 11.5 | 56.9 | 10.3 | 30.1 | 8.1 | 74.0 | 19.2 | 23.0 | 51.0 | 54.3 | 10.1 |