pregen-metrics

Code used by the paper "Pre-gen metrics: Predicting caption quality metrics without generating captions" (to be published).

Image caption generators are typically evaluated by using them to generate captions for images in a test set and then comparing the generated captions to the reference sentences provided in the test set. The functions that measure the similarity between the generated and reference sentences are called post-gen(eration) metrics and these include metrics like METEOR, CIDEr, and SPICE. This paper investigates different ways to compute pre-gen(eration) metrics, which are metrics for evaluating image caption generators without needing to generate any sentences, for example language model perplexity. Avoiding the generation step allows for the evaluation process to finish in half a minute instead of twenty minutes. We find that the best pre-gen metric has a correlation co-efficient with CIDEr of R^2 = 0.937.

Works on Python 3.

Dependencies

Python dependencies (install all with pip):

tensorflow (v1.4)
future
numpy
scipy
h5py
pandas

Before running

Download Karpathy's Flickr8k, Flickr30k, and MSCOCO datasets (including image features).
Download this fork of the MSCOCO Evaluation toolkit.
Download the contents of the where-image2 results folder into model_data. Trained models are not available for immediate download but can be generated by running the experiments from scratch.
Open config.py.
Set debug to True or False (True is used to run a quick test).
Set raw_data_dir to return the directory to the Karpathy datasets (dataset_name is 'flickr8k', 'flickr30k', or 'mscoco').
Set mscoco_dir to the directory to the MSCOCO Evaluation toolkit.

File descriptions

File name	Description
`results`	Folder storing results of each pre-gen metric. You can find the predicted scores for all pre- and post-gen metrics in `results/data.txt`. The correlations between each pre-gen metric with each post-gen metric are found in `results/correlations_*.txt` and the timings of pre- and post-gen metrics are recorded in `results/timing.txt`.
`best_pregen_example_code.py` (example)	A demonstration on how to calculate the pre-gen metric which was found to be the best in these experiments.
`correlation_phase1.py` (main)	Used to cache the predicted probabilities of each caption generator being evaluated with the pre- and post-gen metrics. Probabilities are stored in `test_probs`.
`correlation_phase2.py` (main)	Used to generate the pre- and post-gen metric scores for each caption generator being evaluated. Scores are stored in `results/data.txt`.
`correlation_phase3.py` (main)	Used to measure the correlation co-efficient of all pre-gen metrics to all post-gen metrics. Results are stored in `results/correlations_*.txt`.
`get_timings.py` (main)	Used to measure how long pre- and post-gen metrics take to complete.
`results.xlsx` (processed data)	MS Excel spreadsheet with the results.

Other files are copied from where-image2.

mtanti/pregen-metrics

pregen-metrics

Dependencies

Before running

File descriptions