This repository contains versions of automatically generated datasets for abstractive and extractive query-based multi-document summarization as described in AQuaMuSe paper.
High-level Notes:
- Dependencies: Documents URLs references the Common Crawl June 2017 Archive.
- Data Format:
- Directory structure:
- Each dataset release with have two top-level folders:
abstractive
andextractive
. - Each top-level folder contains three sub-folders for
train
,dev
andtest
examples.
- Each dataset release with have two top-level folders:
- File format: TFrecords.
- Fields:
query
: input query to be used as summarization context. This is a single valuedbyte_list
feature, derived from Natural Questions user queries.input_urls
: List of URLs to input documents pointing to Common Crawl to be summarized. Each URL is separated with a special token separator<EOD>
.target
: Summarization target, derived from Natural Questions long answers.
- Directory structure:
This is not an official Google product.