Google Research Datasets

Datasets released by Google Research

Mountain View, CA

Pinned Repositories

conceptual-12m
Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
368 13 721
conceptual-captions
Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.
Language:Shell520 18 1926
dstc8-schema-guided-dialogue
The Schema-Guided Dialogue Dataset
Language:Python549 38 50125
natural-questions
Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems.
Language:Python940 35 20154
Objectron
Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the object’s position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes
Language:Jupyter Notebook2.2k 64 66263
paws
This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Language:Python555 14 1452
ToTTo
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.
437 11 837
tydiqa
TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the training and evaluation of automatic question answering systems. This repository provides evaluation code and a baseline system for the dataset.
Language:Python293 10 843
wiki-reading
This repository contains the three WikiReading datasets as used and described in WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 (the English WikiReading dataset) and Byte-level Machine Reading across Morphologically Varied Languages, Kenter et al, AAAI-18 (the Turkish and Russian datasets).
Language:Python271 21 1132
wit
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
1k 38 741

Google Research Datasets's Repositories

google-research-datasets/wiki-atomic-edits
A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.
1068
google-research-datasets/noun-verb
This dataset contains naturally-occurring English sentences that feature non-trivial noun-verb ambiguity.
355
google-research-datasets/dialogue
22
google-research-datasets/coarse-discourse
A large corpus of discourse annotations and relations on ~10K forum threads.
Language:Python23933
google-research-datasets/query-wellformedness
25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.
8512
google-research-datasets/simulated-dialogue
23038
google-research-datasets/wiki-reading
This repository contains the three WikiReading datasets as used and described in WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 (the English WikiReading dataset) and Byte-level Machine Reading across Morphologically Varied Languages, Kenter et al, AAAI-18 (the Turkish and Russian datasets).
Language:Python27132
google-research-datasets/quest
Question Understanding and Evaluation on StackExchange
google-research-datasets/sentence-compression
Large corpus of uncompressed and compressed sentences from news articles.
12319
google-research-datasets/sense-anaphora
Publicly released data: sense anaphora annotations.
104
google-research-datasets/wiki-links
Automatically exported from code.google.com/p/wiki-links
415
google-research-datasets/nyt-salience
Automatically exported from code.google.com/p/nyt-salience
2211
google-research-datasets/relation-extraction-corpus
Automatically exported from code.google.com/p/relation-extraction-corpus
568

Google Research Datasets

Pinned Repositories

conceptual-12m

conceptual-captions

dstc8-schema-guided-dialogue

natural-questions

Objectron

paws

ToTTo

tydiqa

wiki-reading

wit

Google Research Datasets's Repositories

google-research-datasets/wiki-atomic-edits

google-research-datasets/noun-verb

google-research-datasets/dialogue

google-research-datasets/coarse-discourse

google-research-datasets/query-wellformedness

google-research-datasets/simulated-dialogue

google-research-datasets/wiki-reading

google-research-datasets/quest

google-research-datasets/sentence-compression

google-research-datasets/sense-anaphora

google-research-datasets/wiki-links

google-research-datasets/nyt-salience

google-research-datasets/relation-extraction-corpus