This directory contains the annotated data from version 2.0 of the Bank of Effects and Causes Stated Explicitly (BECAUSE).1 This is the exhaustively annotated dataset reported on in Dunietz et al., 2017b -- an expanded and improved version of the dataset reported on in Dunietz et al., 2017a. (You can find the data from the 1.0 release in its own branch.)
The corpus is an attempt to capture the enormous variety of constructions used to express cause and effect. Inspired by the principles of Construction Grammar, we annotate any conventional pattern that expresses causation, however complex. BECAUSE thus includes many constructions that are not annotatable in most schemes, and is more comprehensive than previous efforts to capture causal language. This version also includes annotations for seven types of semantic relations that frequently overlap with causality and are sometimes used to express it. For details, see the aforementioned papers.
The list of causal constructions is available in the constructicon used by annotators for this release.
All annotations are in .ann
files formatted for brat, and we have included the annotation.conf
and visual.conf
files for brat in this directory. There are four data subdirectories, each containing data from a different source:
-
CongressionalHearings: Three partial documents from the 2014 NLP Unshared Task in PoliInformatics. These documents are freely available, but for ease of processing, some header information was stripped from the text files. We also annotated only portions of these files, not complete transcripts. To allow for others to use our annotation offsets, then, we have included the preprocessed text files alongside the annotations.
-
NYT: 59 randomly selected documents from the year 2007 of the Washington section of the New York Times Annotated Corpus (Sandhaus, 2008). Each
.ann
file shares its name with the corresponding raw NYT article. An LDC subscription is required to obtain the raw files. To turn the raw XML into the plain text files that the annotation offsets correspond to, you will need to runextract_nyt.py
on a system with access tosed
. The script depends on NLTK. -
PTB: 47 documents randomly selected from sections 2-23 of the Penn Treebank (Marcus et al., 1994). We excluded WSJ documents that were either earnings reports or corporate leadership/structure announcements, as both tended to be merely short lists of names/numbers. Again, we provide offset annotations named to match the raw PTB files, but the raw files require an LDC subscription.
-
MASC: 10 newspaper documents (Wall Street Journal and New York Times articles, totalling 547 sentences) and 2 journal documents (82 sentences) from the Manually Annotated Sub-Corpus (MASC; Ide et al., 2010).
The first three sets of documents are the same dataset that was annotated for BECAUSE 1.0.
[1]: Prior to May 2017, the corpus was referred to with a different capitalization scheme (BECauSE). The official name for the corpus is now BECAUSE, but either format is fine for citation.
Dunietz, Jesse, Lori Levin, and Jaime Carbonell. The BECauSE Corpus 2.0: Annotating Causality and Overlapping Relations. Proceedings of LAW XI – The 11th Linguistic Annotation Workshop (2017b): Association for Computational Linguistics, Valencia, Spain.
Dunietz, Jesse, Lori Levin, and Jaime Carbonell. Automatically Tagging Constructions of Causation and Their Slot-Fillers. Transactions of the Association for Computational Linguistics (2017a): in press.
Dunietz, Jesse, Lori Levin, and Jaime Carbonell. Annotating Causal Language Using Corpus Lexicography of Constructions. Proceedings of LAW IX – The 9th Linguistic Annotation Workshop (2015): 188-196. Association for Computational Linguistics, Denver, USA.
Sandhaus, Evan. 2008. The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia, USA.
Nancy Ide, Christiane Fellbaum, Collin Baker, and Rebecca Passonneau. The Manually Annotated Sub-Corpus: A community resource for and by the people. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (2010): 68-73. Association for Computational Linguistics, Uppsala, Sweden.
Marcus, Mitchell, et al. 1994. The Penn Treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, HLT '94, pages 114-119. Association for Computational Linguistics, Plainsboro, USA.