/aquamuse

AQuaMuSe is a novel scalable approach to automatically mine dual query based multi-document summarization datasets for extractive and abstractive summaries using question answering dataset (Google Natural Questions) and large document corpora (Common Crawl)

Dataset for Query-based Multi-Document Summarization

This repository contains versions of automatically generated datasets for abstractive and extractive query-based multi-document summarization as described in AQuaMuSe paper.

High-level Notes:

  • Dependencies: Documents URLs references the Common Crawl June 2017 Archive.
  • Data Format:
    • Directory structure:
      • Each dataset release with have two top-level folders: abstractive and extractive.
      • Each top-level folder contains three sub-folders for train, dev and test examples.
    • File format: TFrecords.
    • Fields:
      • query: input query to be used as summarization context. This is a single valued byte_list feature, derived from Natural Questions user queries.
      • input_urls: List of URLs to input documents pointing to Common Crawl to be summarized. Each URL is separated with a special token separator <EOD>.
      • target: Summarization target, derived from Natural Questions long answers.

Disclaimer

This is not an official Google product.