Computer Science Named Entity Recognition in the Open Research Knowledge Graph (CS-NER dataset)

1) About

This work proposes a standardized CS-NER task by defining a set of seven contribution-centric scholarly entities for CS NER viz., research problem , solution , resource , language , tool , method , and dataset .

The main contributions are:

Merges annotations for contribution-centric named entities from related work as the following datasets:
- The dataset proposed in Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers (Gupta & Manning, IJCNLP 2011) is the source for ftd, annotated for both titles and abstracts for the following select entities mapped to our standardized types focus -> solution ; domain -> research problem ; and technique -> method
- The dataset proposed in Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction (Luan et al., EMNLP 2018) is the source for scierc, annotated for abstracts for the following select entities with mappings task -> research problem
- The dataset proposed in SemEval-2021 Task 11: NLPContributionGraph - Structuring Scholarly NLP Contributions for a Research Knowledge Graph (D’Souza et al., SemEval 2021) is the source for ncg, annotated for both titles and abstracts for research problem
- https://paperswithcode.com/ as the pwc annotated for both titles and abstracts for task -> research problem and method entities.
Additionally, supplies a new annotated dataset for the titles in the ACL anthology in the acl repository where titles are annotated with all seven entities.

2) Dataset Statistics for full dataset

Please note the numbers below reflect the total annotated entities. They do not reflect the unique set of annotated entities.

Titles

train.data

NER	Count
solution	18,924
research problem	15,646
method	8,854
resource	7,346
tool	1,718
language	1,141
dataset	882

dev.data

NER	Count
solution	1,072
research problem	989
method	574
resource	439
tool	93
language	50
dataset	39

test.data

NER	Count
solution	8,316
research problem	4,070
resource	3,226
method	2,768
tool	743
language	499
dataset	228

Abstracts

train-abs.data

NER	Count
method	10,992
research problem	7,485

dev-abs.data

NER	Count
method	719
research problem	603

test-abs.data

NER	Count
method	2,723
research problem	2,100

The remaining repositories have specialized README files with the respective dataset statistics.

3) Citation

If this work inspires your further research, please consider citing our ICADL 2022 proceedings paper.

@InProceedings{10.1007/978-3-031-21756-2_3,
author="D'Souza, Jennifer
and Auer, S{\"o}ren",
editor="Tseng, Yuen-Hsien
and Katsurai, Marie
and Nguyen, Hoa N.",
title="Computer Science Named Entity Recognition in the Open Research Knowledge Graph",
booktitle="From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries",
year="2022",
publisher="Springer International Publishing",
address="Cham",
pages="35--45",
abstract="Domain-specific named entity recognition (NER) on Computer Science (CS) scholarly articles is an information extraction task that is arguably more challenging for the various annotation aims that can hamper the task and has been less studied than NER in the general domain.  Given that significant progress has been made on NER, we anticipate that scholarly domain-specific NER will receive increasing attention in the years to come. Currently, progress on CS NER -- the focus of this work -- is hampered in part by its recency and the lack of a standardized annotation aims for scientific entities/terms. Directly addressing these issues, this work proposes a standardized task by defining a set of seven contribution-centric scholarly entities for CS NER viz., research problem, solution, resource, language, tool, method, and dataset.",
isbn="978-3-031-21756-2"
}

Preprint

@article{d2022computer,
  title={Computer Science Named Entity Recognition in the Open Research Knowledge Graph},
  author={D'Souza, Jennifer and Auer, S{\"o}ren},
  journal={arXiv preprint arXiv:2203.14579},
  year={2022}
}

4) Additional resources

CS NER Software trained on the dataset in this repository

Codebase: https://gitlab.com/TIBHannover/orkg/nlp/orkg-nlp-experiments/-/tree/master/orkg_cs_ner

Service URL - REST API: https://orkg.org/nlp/api/docs#/annotation/annotates_paper_annotation_csner_post

Service URL - PyPi: https://orkg-nlp-pypi.readthedocs.io/en/latest/services/services.html#cs-ner-computer-science-named-entity-recognition

jd-coderepos/contributions-ner-cs