/PLOD-AbbreviationDetection

This repository contains the PLOD Dataset for Abbreviation Detection released with our LREC 2022 publication

Primary LanguageJupyter NotebookCreative Commons Attribution Share Alike 4.0 InternationalCC-BY-SA-4.0

logo        logo       logo

PLOD: An Abbreviation Detection Dataset

GitHub issues GitHub stars GitHub forks GitHub license Twitter Follow Twitter Follow


PWC


This is the repository for PLOD Dataset submitted to LREC 2022. The dataset can help build sequence labelling models for the task Abbreviation Detection. The paper is available here and additionally, here.

Dataset

We provide two variants of our dataset - Filtered and Unfiltered. They are described in our paper here.

  1. The Filtered version can be accessed via Huggingface Datasets here and a CONLL format is present here in the data folder.

  2. The Unfiltered version can be accessed via Huggingface Datasets here and a CONLL format is present here in the data folder.

  3. The SDU Shared Task data we use for zero-shot testing is available here.

Installation

We use the custom NER pipeline in the spaCy transformers library to train our models. This library supports training via any pre-trained language models available at the 🚀 HuggingFace repository.
Please see the instructions at these websites to setup your own custom training with our dataset to reproduce the experiments using Spacy.

OR

However, you can also reproduce the experiments via the Python notebook we provide here which uses HuggingFace Trainer class to perform the same experiments. The exact hyperparameters can be obtained from the models readme cards linked below. Before starting, please perform the following steps:

git clone https://github.com/surrey-nlp/PLOD-AbbreviationDetection
cd PLOD-AbbreviationDetection
pip install -r requirements.txt

Now, you can use the notebook to reproduce the experiments.

Model(s)

Our best performing models are hosted on the HuggingFace models repository

Models PLOD - Unfiltered PLOD - Filtered Description
RoBERTalarge RoBERTalarge-finetuned-abbr -soon- Fine-tuning on the RoBERTalarge language model
RoBERTabase -soon- RoBERTabase-finetuned-abbr Fine-tuning on the RoBERTabase language model
AlBERTlarge-v2 AlBERTlarge-v2-finetuned-abbDet -soon- Fine-tuning on the AlBERTlarge-v2 language model

On the link provided above, the model(s) can be used with the help of the Inference API via the web-browser itself. We have placed some examples with the API for testing.

Usage

You can use the HuggingFace Model link above to find the instructions for using this model in Python locally using the notebook provided in the Git repo.

Citation

Zilio, L., Saadany, H., Sharma, P., Kanojia, D. and Orasan, C., 2022. PLOD: An Abbreviation Detection Dataset for Scientific Documents. arXiv preprint arXiv:2204.12061.

BibTex Citation

Please use the following citation while citing this work:

@InProceedings{zilio-EtAl:2022:LREC,
  author    = {Zilio, Leonardo  and  Saadany, Hadeel  and  Sharma, Prashant  and  Kanojia, Diptesh  and  Orăsan, Constantin},
  title     = {PLOD: An Abbreviation Detection Dataset for Scientific Documents},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {680--688},
  abstract  = {The detection and extraction of abbreviations from unstructured texts can help to improve the performance of Natural Language Processing tasks, such as machine translation and information retrieval. However, in terms of publicly available datasets, there is not enough data for training deep-neural-networks-based models to the point of generalising well over data. This paper presents PLOD, a large-scale dataset for abbreviation detection and extraction that contains 160k+ segments automatically annotated with abbreviations and their long forms. We performed manual validation over a set of instances and a complete automatic validation for this dataset. We then used it to generate several baseline models for detecting abbreviations and long forms. The best models achieved an F1-score of 0.92 for abbreviations and 0.89 for detecting their corresponding long forms. We release this dataset along with our code and all the models publicly at https://github.com/surrey-nlp/PLOD-AbbreviationDetection},
  url       = {https://aclanthology.org/2022.lrec-1.71}
}

Maintainer(s)

Diptesh Kanojia
Prashant Sharma