long-summarization: A Python repository from WebisD

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

This repository contains data and code for the NAACL 2018 paper "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents". Please note that the code is not actively maintained.

Data

Two datasets of long and structured documents (scientific papers) are provided. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Get the datasets

ArXiv dataset: Download
PubMed dataset: Download

The datasets are rather large. You need about 5G disk space to download and about 15G additional space when extracting the files. Each tar file consists of 4 files. train.txt, val.txt, test.txt respectively correspond to the training, validation, and test sets. The vocab file is a plaintext file for the vocabulary.

Format of the data

The files are in jsonlines format where each line is a json object corresponding to one scientific paper from ArXiv or PubMed. The abstract, sections and body are all sentence tokenized. The json objects are in the following format:

{                                                        
  'article_id': str,                            {                          
  'abstract_text': List[str],                     'article_id': str,
  'article_text': List[str],                      'methodology': List[str],
  'section_names': List[str],                     'method_summary': List[str]
  'sections': List[List[str]]                   }
}

Dataset Statistics

Dataset	# docs	avg. doc. length (words)	avg. summary length (words)
CNN	92K	656	43
Daily Mail	219K	693	52
NY Times	655K	530	38
Gigaword	4M	31	8
Pubmed	133K	3016	203
arXiv	215K	4938	220
arXiv-Methodology (this work)	1006	1358	75

Results

RG-1	RG-2	RG-3	RG-L
24.54	3.08	0.37	19.93

Examples

0804.1964

we here calibrate gaussians in isotropic neighborhood and gaussians , in order to their variability in aluminum in their energies . the dominant model is represented , and the latter , together with the cme : the multi - cme which

1301.7131

in this paper , we present a method to quantify the hamming - central central narrowing ( tf ) scenario of the half - processing chain , which nearest - contacted of m31 the m31 and other produced from the strength distance . in the model we model the numbers uncertainties using the best - field model and the decay of the respective satellite . furthermore , in which the significance of m31 s , the strong dirac cone exists from the coordinate distance and the actual earth torino into account captured with the mimes .

1401.4668

we present the theoretical space for the network rate of the network of cortical trains . we have calculated a two - range of the network of network in a network of inhibition - spike potential , and the sub - ray emitting poisson trains .

Tensorflow datasets

The dataset is also available on Tensorflow Datasets which makes it easy to use within Tensorflow or colab.

Code

The code is based on the pointer-generator network code by See et al. (2017). Refer to their repo for documentation about the structure of the code. You will need python 3.6 and Tensorflow 1.5 to run the code. The code might run with later versions of Tensorflow but it is not tested. Checkout other dependencies in requirements.txt file. A small sample of the dataset is already provided in this repo. To run the code with the sample data unzip the files in the data directory and simply execute the run script: ./run.sh. To train the model with the entire dataset, first convert the jsonlines files to binary using the the following script: scripts/json_to_bin.py and modify the corresponding training data path in the run.sh script.

Citing

If you ended up finding this paper or repo useful please cite:

"A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents"  
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian  
NAACL-HLT 2018

Another relevant reference is Pointer-Generator network by See et al. (2017):

"Get to the point: Summarization with pointer-generator networks."  
Abigail See, Peter J. Liu, and Christopher D. Manning.  
ACL (2017).

WebisD/long-summarization