Python version: 3.9.2
- Run:
bash install.sh
- Download the pacscore clip_ViT-B-32.pth fine-tuned checkpoint from the pacscore repo and locate it under pacscore/checkpoints.
To use the ensemble evaluation method with our proposed weights, use:
from metrics.ensembeval_score import compute_ensembeval_score
scores = compute_ensembeval_score(candidates, references, image_paths)
Where candidates
is a list of captions, references
is a list of lists of reference captions, image_paths
is a list of strings with locations of images. For example:
import json
from metrics.ensembeval_score import compute_ensembeval_score
with open('example/references.json', 'r') as fp:
ref_dict = json.load(fp)
with open('example/candidates.json', 'r') as fp:
cand_dict = json.load(fp)
correct_candidates = [cand_dict['im1'], cand_dict['im2']]
references = [ref_dict['im1'], ref_dict['im2']]
image_paths = ['example/im1.jpg', 'example/im2.jpg']
scores = compute_ensembeval_score(correct_candidates, references, image_paths) # 0.9107, 0.9712
incorrect_candidates = [cand_dict['im2'], cand_dict['im1']]
scores = compute_ensembeval_score(incorrect_candidates, references, image_paths) # 0.2022, 0.2118
To use your own weights, provide a mapping from metric to weight using the weights
arguments:
scores = compute_ensembeval_scorecandidates, references, image_paths, weights=your_weights)
If none of the metrics in your provided weights use references or images, you can specify None
in the corresponding argument.
To replicate the results for correlation with human ratings provided in the paper, first download the relevant data, and then use our code to run the experiments.
Each human ratings datasets requires data from existing image captioning datasets. First, download the following image captioning datasets:
- Flickr8k (needed for Flickr8k-Expert, Flickr8k-CF and Composite): Download the dataset from the following link.
- Flickr30k (needed for Composite and the Reformulations dataset): Download the dataset from the following link.
- MSCOCO (needed for Composite, ThuMB and the Reformulations dataset): Download the 2014 val images from the following link.
- For both MSCOCO and Flickr30k, download the Karpathy split json files (dataset_coco.json and dataset_flickr30k.json) from the following link.
- Pascal (needed for the Pascal50S dataset): Download the 2008 training/validation images from the following link.
- Composite: Download the AMT results from the following link.
- ThuMB: Clone the ThuMB repo under the project root:
git clone https://github.com/jungokasai/THumB.git
- Pascal50S: Downloaded the consensus ratings from the following link.
- The reformulations dataset is currently included in this github repository, for anonymity.
After downloading all of the above, unzip and untar where necessary, and update the relevant paths in the config.py file.
To compute the correlation with human ratings, run the following:
python run_experiments.py --dataset [flickr8k_expert/flickr_cf/composite/thumb/polaris]
To compute the pairwise accuracy, run the following:
python run_experiments.py --dataset [pascal50/reformulations] --eval_method pairwise
Note: the CLIPImageScore metric is highly time consuming, and is by default not included in this scripts. If you do wish to use it (acknowledging that it would take very long), use the --clip_image_score
flag:
python run_experiments.py --dataset [flickr8k_expert/flickr_cf/composite/thumb/polaris] --clip_image_score
python run_experiments.py --dataset [pascal50/reformulations] --eval_method pairwise --clip_image_score
To replicate the analysis and plots regarding metric usage analysis from the paper, run the following:
python metric_usage_analysis.py
This will generate figures 2,3,5 from the paper and will dump them in the project directory.