Conceptual Captions Dataset

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

Downloads

See http://ai.google.com/research/ConceptualCaptions for details.

Motivation

Automatic image captioning is the task of producing a natural-language utterance (usually a sentence) that correctly reflects the visual content of an image. Up to this point, the resource most used for this task was the MS-COCO dataset, containing around 120,000 images and 5-way image-caption annotations (produced by paid annotators).

Google's Conceptual Captions dataset has more than 3 million images, paired with natural-language captions. In contrast with the curated style of the MS-COCO images, Conceptual Captions images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. The raw descriptions are harvested from the Alt-text HTML attribute associated with web images. We developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.

More details are available in this paper (please cite the paper if you use or discuss this dataset in your work):

@inproceedings{sharma2018conceptual,
  title = {Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning},
  author = {Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu},
  booktitle = {Proceedings of ACL},
  year = {2018},
}

Dataset Description

Conceptual Captions dataset release contains two splits: train (~3.3M examples) and validation (~16K examples). See Table 1 below for more details.

Table 1: Dataset stats.

			Tokens per Caption
Split	Examples	Uniqe Tokens	Mean	StdDev	Median
Train	3,318,333	51,201	10.3	4.5	9.0
Valid	15,840	10,900	10.4	4.7	9.0
Test (Hidden)	12,559	9,645	10.2	4.6	9.0

Hidden Test set

We are not releasing the official test split (~12.5K examples). Instead, we are hosting a competition (see http://ai.google.com/research/ConceptualCaptions) dedicated to supporting submissions and evaluations of model outputs on this blind test set.

We strongly believe that this setup has several advantages: a) it allows the evaluation to be done using an unbiased, large number of images b) it keeps the test completely blind and eliminate suspicions of fitting to the test, cheating, etc. c) it overall provides a clean setup for advancing the SoTA on this task, including reporting reproducible results for paper publications, etc.

Data Format

The released data is provided as TSV (tab-separated values) text files with the following columns:

Table 2: Columns in TSV files.

Column	Description
1	Caption. The text has been tokenized and lowercased.
2	Image URL

Contact us

If you have a technical question regarding the dataset, code or publication, please create an issue in this repository. This is the fastest way to reach us.

If you would like to share feedback or report concerns, please email us at conceptual-captions@google.com