AASC: A Perl repository from umnlp

AASC: ACL Anthology Sentence Corpus

AASC is a corpus of natural language text extracted from scientific papers. It contains 2,339,195 sentences from PDF-format papers from the ACL Anthology [1], a comprehensive scientific paper repository on computational linguistics and natural language processing.

For PDF document analysis, we use PDFNLT 1.0 [2], a PDF paper analysis tool specifically trained for ACL Anthology. After excluding papers with non-standard structures (eg. no abstract, or no references), the rest 13,923 papers were further processed by (1) sentence splitting, and (2) section type labeling.

The ACL_2018_v2.tar.gz file contains the extracted natural language sentences for each <paper_ID>, where the <paper_ID> is the unique identifier of the paper on the ACL Anthology. The corresponding PDF version can be found using the URL: http://aclweb.org/anthology/<paper_ID>.

Each sentence file is named as <paper_ID>.ss within which each line represents tab-separated values of a sentence:

Column	Example (A00-1001.ss)
Sentence ID	`s-1-1-0-0`
Section type	`abstract`
Sentence text:	`The paper describes a natural language based expert system route advisor for the public bus transport in Trondheim, Norway.`

A simple dictionary-based classifier was used for the section type labeling.

For details, see also our Overview of AASC

Following the copyright policy of the original ACL Anthology, AASC materials are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License.
This work was supported by National Institute of Informatics and JST Crest JPMJCR1513.

umnlp/AASC

AASC: ACL Anthology Sentence Corpus