/FedNLP

npj Digital Medicine "An In-Depth Evaluation of Federated Learning on Biomedical Natural Language Processing for Information Extraction"

Primary LanguagePythonMIT LicenseMIT

FedNLP

Offical git repo of paper "An In-Depth Evaluation of Federated Learning on Biomedical Natural Language Processing for Information Extraction"

First stable release

What we have 🌟

  • support models with various architectures including BERT, GPT, and BILSTM-CRF
  • simulate federated learning using FedAvg
  • log the result using tensorboard
  • auxiliary bash script to download models from hugging face, run batch python script for fast prototype
  • simulate distribution shift in federated learning
  • study the impact of federated learning under different federation scales
  • study the impact of different federated learning algorithms

What is expected to see in the next release 🚀

  • comparison with LLM

Overview

task models datasets FL
NER BERT-base-uncased; BlueBERT; BioBERT; Bio_ClinicalBERT; GPT2; BiLSTM-CRF 2018_n2c2; BC2GM; BC5CDR-disease; JNLPBA; NCBI-disease download 🔗 FedAvg
RE BERT-base-uncased; BlueBERT; BioBERT; Bio_ClinicalBERT; GPT2 2018_n2c2; euadr; GAD download 🔗 FedAvg

Installation

git clone https://github.com/PL97/FedNLP.git
cd FedNLP/

## setup running environments
conda create -n fednlp python==3.9.12
conda activate fednlp
pip install -r requirements.txt

## download model from hugging face
chmod +x download_pretrained_models.sh
./download_pretrained_models.sh

Datasets

Download the dataset using the link in the table. Rename NER/RE to data and place under FedNLP/NER/ and FedNLP/RE/ respectively.

Usage

Named Entity Recognition (NER)

centralized training 👇

cd FedNLP/NER
chmod -R +x bash_scripts/

## run from python script
mkdir -p workspace_BC2GM/bluebert/baseline/
python main.py \
    --ds BC2GM \
    --split site-0 \
    --workspace  workspace_BC2GM/bluebert/baseline/\
    --model bluebert \
    --epochs 50


## alternatively can run from bash script (recommended)
./bash_scripts/run.sh site-0 BC2GM bluebert 50  ## arg1: data split; arg2: dataset; arg3: model; arg4: total epochs

federated training 👇

cd FedNLP/NER

## run from bash script (recommended)
 ./bash_scripts/fed.sh fedavg BC2GM 10 bluebert 50  ## arg1: FL algorithm; arg2: dataset; arg3: total data splits; arg4: model; arg5: total epochs

Relation Extraction (RE)

centralized training 👇

cd FedNLP/RE

chmod -R +x bash_scripts/
./bash_scripts/run.sh site-0 euadr bluebert 50 ## arg1: data split; arg2: datasets; arg3: model; arg4: total epochs

federated learning 👇

cd FedNLP/RE

 ./bash_scripts/fed.sh fedavg euadr 10 bluebert 50  ## arg1: FL algorithm; arg2: dataset; arg3: total data splits; arg4: model; arg5: total epochs

How to cite this work

@inproceedings{
    peng2023a,
    title={A Systematic Evaluation of Federated Learning on Biomedical Natural Language Processing},
    author={Le Peng and sicheng zhou and jiandong chen and Rui Zhang and Ziyue Xu and Ju Sun},
    booktitle={International Workshop on Federated Learning for Distributed Data Mining},
    year={2023},
    url={https://openreview.net/forum?id=pLEQFXACNA}
}