/deft_corpus

The Definition Extraction From Text corpus and relevant formatting scripts

Primary LanguagePythonOtherNOASSERTION

Welcome to the DEFT corpus!

Welcome to the largest expertly annotated corpus for complex definition extraction in free text. Pardon our dust - this data is associated with SemEval 2020 Task 6 (DeftEval) and we are releasing the full dataset on the SemEval conference schedule. Train and dev data are available, and test data will become available after the completion of the SemEval evaluation period on 2 Feb 2020. You can source the complete text from the corresponding textbooks at https://cnx.org.

The most recent version of the corpus was updated on 16 JAN 2020.

For more information regarding the annotation, schema, or general characteristics of the corpus, please see our paper here.

Data Format

We are currently releasing annotated data using a CoNLL 2003-like format with the following structure:

TOKEN TXT_SOURCE_FILE START_CHAR END_CHAR TAG TAG_ID ROOT_ID RELATION

Character indices are derived from the brat standoff format. Tags follow a BIO format with the tag schema outlined in the paper.

DeftEval Results

Results for SemEval 2020 Task 6 - DeftEval are included below:

Subtask 1 Results

Subtask 2 Results

Subtask 3 Results

We will continue to update the official leaderboard as the final evaluation period closes.

Licensing Information

The entire dataset of textbook sentences with annotations is available for use under the CC BY-NC-SA 4.0 license. Contact the authors for information on commercial use.

Acknowledgements

We would like to acknowledge the contributions of the annotation team, without which we would not have a corpus to share. Many thanks to Lucino Chiafullo, Danyi Huang, Micaela Kaplan, Roger LaCroix, Molly Moran, Jennifer Pei-Hsuan Lee, Harper Pollio-Barbee, and Keren Sun for their annotations and contributions.

Citation

If you use the DEFT corpus in your publication, please cite this paper:

@inproceedings{spala-etal-2019-deft,
    title = "{DEFT}: A corpus for definition extraction in free- and semi-structured text",
    author = "Spala, Sasha  and
      Miller, Nicholas A.  and
      Yang, Yiming  and
      Dernoncourt, Franck  and
      Dockhorn, Carl",
    booktitle = "Proceedings of the 13th Linguistic Annotation Workshop",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-4015",
    pages = "124--131",
    abstract = "Definition extraction has been a popular topic in NLP research for well more than a decade, but has been historically limited to well-defined, structured, and narrow conditions. In reality, natural language is messy, and messy data requires both complex solutions and data that reflects that reality. In this paper, we present a robust English corpus and annotation schema that allows us to explore the less straightforward examples of term-definition structures in free and semi-structured text.",
}