/pregen-metrics

Code used by the paper "Pre-gen metrics: Predicting caption quality metrics without generating captions".

Primary LanguagePythonMIT LicenseMIT

pregen-metrics

Code used by the paper "Pre-gen metrics: Predicting caption quality metrics without generating captions" (to be published).

Image caption generators are typically evaluated by using them to generate captions for images in a test set and then comparing the generated captions to the reference sentences provided in the test set. The functions that measure the similarity between the generated and reference sentences are called post-gen(eration) metrics and these include metrics like METEOR, CIDEr, and SPICE. This paper investigates different ways to compute pre-gen(eration) metrics, which are metrics for evaluating image caption generators without needing to generate any sentences, for example language model perplexity. Avoiding the generation step allows for the evaluation process to finish in half a minute instead of twenty minutes. We find that the best pre-gen metric has a correlation co-efficient with CIDEr of R^2 = 0.937.

Works on Python 3.

Dependencies

Python dependencies (install all with pip):

  • tensorflow (v1.4)
  • future
  • numpy
  • scipy
  • h5py
  • pandas

Before running

  1. Download Karpathy's Flickr8k, Flickr30k, and MSCOCO datasets (including image features).
  2. Download this fork of the MSCOCO Evaluation toolkit.
  3. Download the contents of the where-image2 results folder into model_data. Trained models are not available for immediate download but can be generated by running the experiments from scratch.
  4. Open config.py.
  5. Set debug to True or False (True is used to run a quick test).
  6. Set raw_data_dir to return the directory to the Karpathy datasets (dataset_name is 'flickr8k', 'flickr30k', or 'mscoco').
  7. Set mscoco_dir to the directory to the MSCOCO Evaluation toolkit.

File descriptions

File name Description
results Folder storing results of each pre-gen metric. You can find the predicted scores for all pre- and post-gen metrics in results/data.txt. The correlations between each pre-gen metric with each post-gen metric are found in results/correlations_*.txt and the timings of pre- and post-gen metrics are recorded in results/timing.txt.
best_pregen_example_code.py (example) A demonstration on how to calculate the pre-gen metric which was found to be the best in these experiments.
correlation_phase1.py (main) Used to cache the predicted probabilities of each caption generator being evaluated with the pre- and post-gen metrics. Probabilities are stored in test_probs.
correlation_phase2.py (main) Used to generate the pre- and post-gen metric scores for each caption generator being evaluated. Scores are stored in results/data.txt.
correlation_phase3.py (main) Used to measure the correlation co-efficient of all pre-gen metrics to all post-gen metrics. Results are stored in results/correlations_*.txt.
get_timings.py (main) Used to measure how long pre- and post-gen metrics take to complete.
results.xlsx (processed data) MS Excel spreadsheet with the results.

Other files are copied from where-image2.