Code used by the paper "Pre-gen metrics: Predicting caption quality metrics without generating captions" (to be published).
Image caption generators are typically evaluated by using them to generate captions for images in a test set and then comparing the generated captions to the reference sentences provided in the test set. The functions that measure the similarity between the generated and reference sentences are called post-gen(eration) metrics and these include metrics like METEOR, CIDEr, and SPICE. This paper investigates different ways to compute pre-gen(eration) metrics, which are metrics for evaluating image caption generators without needing to generate any sentences, for example language model perplexity. Avoiding the generation step allows for the evaluation process to finish in half a minute instead of twenty minutes. We find that the best pre-gen metric has a correlation co-efficient with CIDEr of R^2 = 0.937.
Works on Python 3.
Python dependencies (install all with pip
):
tensorflow
(v1.4)future
numpy
scipy
h5py
pandas
- Download Karpathy's Flickr8k, Flickr30k, and MSCOCO datasets (including image features).
- Download this fork of the MSCOCO Evaluation toolkit.
- Download the contents of the where-image2 results folder into
model_data
. Trained models are not available for immediate download but can be generated by running the experiments from scratch. - Open
config.py
. - Set
debug
to True or False (True is used to run a quick test). - Set
raw_data_dir
to return the directory to the Karpathy datasets (dataset_name
is 'flickr8k', 'flickr30k', or 'mscoco'). - Set
mscoco_dir
to the directory to the MSCOCO Evaluation toolkit.
File name | Description |
---|---|
results |
Folder storing results of each pre-gen metric. You can find the predicted scores for all pre- and post-gen metrics in results/data.txt . The correlations between each pre-gen metric with each post-gen metric are found in results/correlations_*.txt and the timings of pre- and post-gen metrics are recorded in results/timing.txt . |
best_pregen_example_code.py (example) |
A demonstration on how to calculate the pre-gen metric which was found to be the best in these experiments. |
correlation_phase1.py (main) |
Used to cache the predicted probabilities of each caption generator being evaluated with the pre- and post-gen metrics. Probabilities are stored in test_probs . |
correlation_phase2.py (main) |
Used to generate the pre- and post-gen metric scores for each caption generator being evaluated. Scores are stored in results/data.txt . |
correlation_phase3.py (main) |
Used to measure the correlation co-efficient of all pre-gen metrics to all post-gen metrics. Results are stored in results/correlations_*.txt . |
get_timings.py (main) |
Used to measure how long pre- and post-gen metrics take to complete. |
results.xlsx (processed data) |
MS Excel spreadsheet with the results. |
Other files are copied from where-image2.