Microsoft COCO Caption Evaluation

Evaluation codes for MS COCO caption generation.

Description

This repository provides Python 3 support for the caption evaluation metrics used for the MS COCO dataset.

The code is derived from the original repository that supports Python 2.7: https://github.com/tylin/coco-caption.
Caption evaluation depends on the COCO API that natively supports Python 3.

Requirements

Java 1.8.0
Python 3
For CLIPScore, both pytorch and OpenAI's CLIP are required.

Installation

To install pycocoevalcap and the pycocotools dependency (https://github.com/cocodataset/cocoapi), run:

pip install git+https://github.com/jmhessel/pycocoevalcap

Usage

See the example script: example/coco_eval_example.py

Files

eval.py: The file includes COCOEavlCap class that can be used to evaluate results on COCO.
tokenizer: Python wrapper of Stanford CoreNLP PTBTokenizer
bleu: Bleu evalutation codes
meteor: Meteor evaluation codes
rouge: Rouge-L evaluation codes
cider: CIDEr evaluation codes
spice: SPICE evaluation codes
clipscore: CLIPScore evaluation codes

Setup

SPICE requires the download of Stanford CoreNLP 3.6.0 code and models. This will be done automatically the first time the SPICE evaluation is performed.
Note: SPICE will try to create a cache of parsed sentences in ./spice/cache/. This dramatically speeds up repeated evaluations. The cache directory can be moved by setting 'CACHE_DIR' in ./spice. In the same file, caching can be turned off by removing the '-cache' argument to 'spice_cmd'.

References

Microsoft COCO Captions: Data Collection and Evaluation Server
PTBTokenizer: We use the Stanford Tokenizer which is included in Stanford CoreNLP 3.4.1.
BLEU: BLEU: a Method for Automatic Evaluation of Machine Translation
Meteor: Project page with related publications. We use the latest version (1.5) of the Code. Changes have been made to the source code to properly aggreate the statistics for the entire corpus.
Rouge-L: ROUGE: A Package for Automatic Evaluation of Summaries
CIDEr: CIDEr: Consensus-based Image Description Evaluation
SPICE: SPICE: Semantic Propositional Image Caption Evaluation
CLIPScore: CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Developers

Xinlei Chen (CMU)
Hao Fang (University of Washington)
Tsung-Yi Lin (Cornell)
Ramakrishna Vedantam (Virgina Tech)

Acknowledgement

David Chiang (University of Norte Dame)
Michael Denkowski (CMU)
Alexander Rush (Harvard University)
Jungo Kasai (UW): for helping to squash a bug with the CLIPScore implementation

v-bosch/pycocoevalcap