/WeLT

A cost-sensitive BERT that handles the class imbalance for the task of biomedical NER.

Primary LanguagePython

WeLT: Improving Biomedical Fine-tuned Pre-trained Language Models with Cost-sensitive Learning

Authors: Ghadeer Mobasher*, Olga Krebs, Wolfgang Müller, and Michael Gertz

For transparency, all the 🤗 models are publicly available as a result of our experimental work on HuggingFaceHub

Biomedical pre-trained language models (BioPLMs) have been achieving state-of-the-art results for various biomedical text-mining tasks. However, prevailing fine-tuning approaches naively train BioPLMs on targeted datasets without considering the class distributions. This is problematic, especially with dealing with imbalanced biomedical gold-standard datasets for named entity recognition (NER). Regardless of the high-performing SOTA fine-tuned NER models, they are biased towards other (O) tags and misclassify biomedical entities. To fill the gap, we propose WELT, a cost-sensitive BERT that handles the class imbalance for the task of biomedical NER. We investigate the impact of WELT against the traditional fine-tuning approaches on mixed-domain and domain-specific BioPLMs. We evaluated WELT against other weighting schemes such as Inverse of Number of Samples (INS), Inverse of Square Root of Number of Samples (ISNS) and Effective Number of Samples (ENS). Our results show the outperformance of WELT on 4 different types of Biomedical BERT models and BioELECTRA using 8 gold-standard datasets.

Installation

Dependencies

  • Python (>=3.6)
  • Pytorch (>=1.2.0)
  1. Clone this GitHub repository: git clone https://github.com/mobashgr/WELT.git
  2. Navigate to the WELT folder and install all necessary dependencies: python3 -m pip install -r requirements.txt
    Note: To install the appropriate torch, follow the download instructions based on your development environment.

Data Preparation

NER Datasets

Dataset Source
  • NCBI-disease
  • BC5CDR-disease
  • BC5CDR-chem
  • BC4CHEMD
  • BC2GM
  • Linnaeus
NER datasets are directly retrieved from BioBERT via this link
  • BioRED-Dis
  • BioRED-Chem
We have extended the aforementioned NER datasets to include BioRED. To convert from BioC XML / JSON to conll, we used bconv and filtered the chemical and disease entities.

Data Download
To directly download NER datasets, use download.sh or manually download them via this link in WELT directory, unzip datasets.zip and rm -r datasets.zip

Data Pre-processing
We adapted the preprocessing.sh from BioBERT to include BioRED

Fine-tuning with handling the class imbalance

We have conducted experiments on different BERT models using WeLT weighting scheme. We have compared WELT against other existing weighting schemes and the corresponding traditional fine-tuning approaches(i.e normal BioBERT fine-tuning)

Fine-tuning BERT Models

Model Used version in HF 🤗
BioBERT model_name_or_path
BlueBERT model_name_or_path
PubMedBERT model_name_or_path
SciBERT model_name_or_path
BioELECTRA model_name_or_path

Weighting Schemes

Name
Inverse of Number of Samples (INS)
Inverse of Square Root of Number of Samples (ISNS)
Effective Number of Samples (ENS)
Weighted Loss Trainer (WeLT) (Ours)

Cost-Sensitive Fine-Tuning

We have adapted BioBERT-run_ner.py to develop in run_weight_scheme.py that extends Trainer class to WeightedLossTrainer and override compute_loss function to include INS, ISNS, ENS and WELT in weighted Cross-Entropy loss function.

Evaluation
For fair comparison we have used the same NER evaluation approach of BioBERT

Usage Example
This is an example of fine-tuning BioRED-Chem over SciBERT using an ENS weight scheme with $\beta$ of 0.3

cd named-entity-recognition
./preprocess.sh

export SAVE_DIR=./output
export DATA_DIR=../datasets/NER

export MAX_LENGTH=384
export BATCH_SIZE=5
export NUM_EPOCHS=20
export SAVE_STEPS=1000
export ENTITY=BioRED-Chem
export SEED=1

python run_weight_scheme.py \
    --data_dir ${DATA_DIR}/${ENTITY}/ \
    --labels ${DATA_DIR}/${ENTITY}/labels.txt \
    --model_name_or_path allenai/scibert_scivocab_uncased \
   --output_dir ${ENTITY}-${MAX_LENGTH}-SciBERT-ENS-0.3\
    --max_seq_length ${MAX_LENGTH} \
    --num_train_epochs ${NUM_EPOCHS} \
    --weight_scheme ENS \
    --beta_factor 0.3  \
    --per_device_train_batch_size ${BATCH_SIZE} \
    --save_steps ${SAVE_STEPS} \
    --seed ${SEED} \
    --do_train \
    --do_eval \
    --do_predict \
    --overwrite_output_dir

Quick Links

-Usage of WeLT
-Hyperparameters

Citation

@inproceedings{mobasher-etal-2023-welt,
   title = "{W}e{LT}: Improving Biomedical Fine-tuned Pre-trained Language Models with Cost-sensitive Learning",
   author = {Mobasher, Ghadeer  and
     M{\"u}ller, Wolfgang  and
     Krebs, Olga  and
     Gertz, Michael},
   booktitle = "The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks",
   month = jul,
   year = "2023",
   address = "Toronto, Canada",
   publisher = "Association for Computational Linguistics",
   url = "https://aclanthology.org/2023.bionlp-1.40",
   pages = "427--438"
}

Acknowledgment

Ghadeer Mobasher* is part of the PoLiMeR-ITN and is supported by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement PoLiMeR, No 812616.