/contributions-ner-cs

This repository hosts the dataset for the paper Computer Science Named Entity Recognition in the Open Research Knowledge Graph

Primary LanguageJupyter Notebook

Computer Science Named Entity Recognition in the Open Research Knowledge Graph (CS-NER dataset)

1) About

This work proposes a standardized CS-NER task by defining a set of seven contribution-centric scholarly entities for CS NER viz., research problem , solution , resource , language , tool , method , and dataset .

The main contributions are:

  1. Merges annotations for contribution-centric named entities from related work as the following datasets:

  2. Additionally, supplies a new annotated dataset for the titles in the ACL anthology in the acl repository where titles are annotated with all seven entities.

2) Dataset Statistics for full dataset

Please note the numbers below reflect the total annotated entities. They do not reflect the unique set of annotated entities.

Titles

train.data

NER Count
solution 18,924
research problem 15,646
method 8,854
resource 7,346
tool 1,718
language 1,141
dataset 882

dev.data

NER Count
solution 1,072
research problem 989
method 574
resource 439
tool 93
language 50
dataset 39

test.data

NER Count
solution 8,316
research problem 4,070
resource 3,226
method 2,768
tool 743
language 499
dataset 228

Abstracts

train-abs.data

NER Count
method 10,992
research problem 7,485

dev-abs.data

NER Count
method 719
research problem 603

test-abs.data

NER Count
method 2,723
research problem 2,100

The remaining repositories have specialized README files with the respective dataset statistics.

3) Citation

If this work inspires your further research, please consider citing our ICADL 2022 proceedings paper.

@InProceedings{10.1007/978-3-031-21756-2_3,
author="D'Souza, Jennifer
and Auer, S{\"o}ren",
editor="Tseng, Yuen-Hsien
and Katsurai, Marie
and Nguyen, Hoa N.",
title="Computer Science Named Entity Recognition in the Open Research Knowledge Graph",
booktitle="From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries",
year="2022",
publisher="Springer International Publishing",
address="Cham",
pages="35--45",
abstract="Domain-specific named entity recognition (NER) on Computer Science (CS) scholarly articles is an information extraction task that is arguably more challenging for the various annotation aims that can hamper the task and has been less studied than NER in the general domain.  Given that significant progress has been made on NER, we anticipate that scholarly domain-specific NER will receive increasing attention in the years to come. Currently, progress on CS NER -- the focus of this work -- is hampered in part by its recency and the lack of a standardized annotation aims for scientific entities/terms. Directly addressing these issues, this work proposes a standardized task by defining a set of seven contribution-centric scholarly entities for CS NER viz., research problem, solution, resource, language, tool, method, and dataset.",
isbn="978-3-031-21756-2"
}

Preprint

@article{d2022computer,
  title={Computer Science Named Entity Recognition in the Open Research Knowledge Graph},
  author={D'Souza, Jennifer and Auer, S{\"o}ren},
  journal={arXiv preprint arXiv:2203.14579},
  year={2022}
}

4) Additional resources

CS NER Software trained on the dataset in this repository

Codebase: https://gitlab.com/TIBHannover/orkg/nlp/orkg-nlp-experiments/-/tree/master/orkg_cs_ner

Service URL - REST API: https://orkg.org/nlp/api/docs#/annotation/annotates_paper_annotation_csner_post

Service URL - PyPi: https://orkg-nlp-pypi.readthedocs.io/en/latest/services/services.html#cs-ner-computer-science-named-entity-recognition