/LSFO-expansion

Primary LanguageJupyter Notebook

Lifestyle factors in the biomedical literature:

comprehensive resources for named entity recognition

We introduce novel resources to address the challenge of recognizing Lifestyle Factors(LSFs) within biomedical text. We present dictionary-based NER and transformer-based NER systems, both demonstrating promising performance in identifying LSFs. We present a novel Lifestyle Factor Ontology (LSFO), featuring a diverse hierarchical classification of LSFs. Additionally, an annotated corpus for LSFs is introduced, enabling the training and evaluation of a transformer-based NER system for LSF detection. Both NER systems were used to detect LSFs in more than 36 Millions PubMed asbtracts and 4.5 Millions PMC Open Access articles, resulting in the identification of over 300 million LSF instances in the biomedical literature.

Associated Zenodo page for files

This page Zenodo - LSFO-expansion contains the following set of files:

  • Tagger dictionary files

  • Lifestyle Factors text corpus

    • Annotated 200 abstracts
    • LSF Annotation guidelines for the corpus
  • Trained models

    • Fine-tuned BioBERT model for dictionary expansion
    • Trained BERTopic model for dictionary expnasion
    • Fine-tuned Transformer-NER model
  • Large-scale runs

    • Input documents for the large-scale runs: 36.1 million PubMed abstracts (as of August 2023) and 4.5 million articles from the PMC open access subset (as of April 2022)
    • Large-scale runs outputs (matched LSFs) from both Tagger and Transformer-based NER: Detected over 300 million Lifestyle-factors

Submodules

LSFO

LSFO - This repository contains the LSFO. Lifestyle-factors classification (LSFO) is a multilevel hierarchical structure that begins with main lifestyle categories at the top level and extends to specific subcategories and low-level concepts.

S1000_Transformer_NER

S1000_Transformer_NER - This repository is a fork of the S1000-transformer-ner project. It has been minorly adapted for specific use in training as a Named Entity Recognition (NER) system focused on the detection of Lifestyle factors.

Installation and Setup

To clone this repository along with its submodules, use the following command:

git clone --recurse-submodules https://github.com/EsmaeilNourani/LSFO-expansion.git

Environment setup:

This code is tested with Python 3.9 installed with conda and the packages from requirements.txt installed in that environment. Running setup.sh will download the pretrained transformer model and install the needed packages.

NER model training:

Quickstart

conda create -n lsf-env python=3.9
conda activate lsf-env
pip install -r requirements.txt
./setup.sh
cd S1000_Transformer_NER
./scripts/run-ner.sh

These create enviroment, installs required packages, runs training on hyperparameters set in run-ner.sh and saves the trained model.

Tagging documents using the trained NER model:

Update run-bio-tagger.sh to point the input files and trained model in the previous step and then run the script:

cd S1000_Transformer_Tagger
./scripts/run-bio-tagger.sh

Note: There are some packages (spacy, scispacy) defined in requirements.txt and test data in tagger fomrat that are not needed for running the model training, but are used with the accompanying repo S1000-transformer-tagger meant for tagging documents with the trained model and reproducing the results. So it's required to setup the environment again for the Transformer-tagger