This is the repository for PLOD Dataset submitted to LREC 2022. The dataset can help build sequence labelling models for the task Abbreviation Detection. The paper is available here and additionally, here.
We provide two variants of our dataset - Filtered and Unfiltered. They are described in our paper here.
-
The Filtered version can be accessed via Huggingface Datasets here and a CONLL format is present here in the data folder.
-
The Unfiltered version can be accessed via Huggingface Datasets here and a CONLL format is present here in the data folder.
-
The SDU Shared Task data we use for zero-shot testing is available here.
We use the custom NER pipeline in the spaCy transformers library to train our models. This library supports training via any pre-trained language models available at the 🚀 HuggingFace repository.
Please see the instructions at these websites to setup your own custom training with our dataset to reproduce the experiments using Spacy.
OR
However, you can also reproduce the experiments via the Python notebook we provide here which uses HuggingFace Trainer class to perform the same experiments. The exact hyperparameters can be obtained from the models readme cards linked below. Before starting, please perform the following steps:
git clone https://github.com/surrey-nlp/PLOD-AbbreviationDetection
cd PLOD-AbbreviationDetection
pip install -r requirements.txt
Now, you can use the notebook to reproduce the experiments.
Our best performing models are hosted on the HuggingFace models repository
Models | PLOD - Unfiltered |
PLOD - Filtered |
Description |
---|---|---|---|
RoBERTalarge | RoBERTalarge-finetuned-abbr | -soon- | Fine-tuning on the RoBERTalarge language model |
RoBERTabase | -soon- | RoBERTabase-finetuned-abbr | Fine-tuning on the RoBERTabase language model |
AlBERTlarge-v2 | AlBERTlarge-v2-finetuned-abbDet | -soon- | Fine-tuning on the AlBERTlarge-v2 language model |
On the link provided above, the model(s) can be used with the help of the Inference API via the web-browser itself. We have placed some examples with the API for testing.
You can use the HuggingFace Model link above to find the instructions for using this model in Python locally using the notebook provided in the Git repo.
Zilio, L., Saadany, H., Sharma, P., Kanojia, D. and Orasan, C., 2022. PLOD: An Abbreviation Detection Dataset for Scientific Documents. arXiv preprint arXiv:2204.12061.
Please use the following citation while citing this work:
@InProceedings{zilio-EtAl:2022:LREC,
author = {Zilio, Leonardo and Saadany, Hadeel and Sharma, Prashant and Kanojia, Diptesh and Orăsan, Constantin},
title = {PLOD: An Abbreviation Detection Dataset for Scientific Documents},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {680--688},
abstract = {The detection and extraction of abbreviations from unstructured texts can help to improve the performance of Natural Language Processing tasks, such as machine translation and information retrieval. However, in terms of publicly available datasets, there is not enough data for training deep-neural-networks-based models to the point of generalising well over data. This paper presents PLOD, a large-scale dataset for abbreviation detection and extraction that contains 160k+ segments automatically annotated with abbreviations and their long forms. We performed manual validation over a set of instances and a complete automatic validation for this dataset. We then used it to generate several baseline models for detecting abbreviations and long forms. The best models achieved an F1-score of 0.92 for abbreviations and 0.89 for detecting their corresponding long forms. We release this dataset along with our code and all the models publicly at https://github.com/surrey-nlp/PLOD-AbbreviationDetection},
url = {https://aclanthology.org/2022.lrec-1.71}
}