/bio-electra

BioMedical Language Processing with ELECTRA

Primary LanguagePython

Bio-ELECTRA

Introduction

"ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset". -- Authors

For a detailed description and implementation refer to the paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, and codebase ELECTRA respectively.

Bio-ELECTRA is pretrained on:

Requirements

Pretraining

Fine-tuning

Pre-training

Use build_pretraining_dataset.py to create a pre-training dataset from a dump of raw text.

Example:

python build_pretraining_dataset.py \
    --corpus-dir $corpus_path \
    --vocab-file $model_path/vocab.txt \
    --output-dir $model_path/pretrain_tfrecords \
    --max-seq-length 128 \
    --blanks-separate-docs False \
    --do-lower-case \
    --do-strip-accents \
    --num-processes 8

Use run_pretraining.py to pre-train an ELECTRA model. Example:

python3 electra_repo/run_pretraining.py \
  --data-dir $model_path \
  --model-name $model_name \
  --hparams '{"model_size": "small", "num_train_steps": 1000000, "vocab_size": 169300}'

Refer to configure_pretraining.py to view/set the supported hyperparameters.

Fine-tuning with original (tensorflow) checkpoints

Use run_finetuning.py to fine-tune and evaluate an ELECTRA model on a downstream NLP task. This code implementation performs fine-tuning on two main downstream tasks: Named-Entity Recognition (NER) (Token Classification), and Relation Extraction (RE) (Sequence Classification). The finetune directory contains code to run each task.

NER datasets: BC2GM, BC5CDR-chem, JNLPBA, NCBI-disease, BC4CHEMD, BC5CDR-disease, linnaeus, s800

RE datasets: CHEMPROT, GAD, and DDI

Example 1: To fine-tune a RE task using the GAD dataset, run

DATA_DIR=<pretrained checkpoint dir>
python run_finetuning.py \
    --data-dir $DATA_DIR \
    --model-name electra-small \
    --hparams '{"model_size": "small", "task_names": ["gad"], "learning_rate": 3e-4, "max_seq_length": 128, "num_trials": 10, "num_train_epochs": 10.0, "train_batch_size": 32, "eval_batch_size": 32, "predict_batch_size": 32}'

Example 2: To fine-tune a NER task using the NCBI-disease dataset, run

DATA_DIR=<pretrained checkpoint dir>
python run_finetuning.py \
    --data-dir $DATA_DIR \
    --model-name electra-small \
    --hparams '{"model_size": "small", "task_names": ["ncbi-disease"], "learning_rate": 3e-4, "max_seq_length": 128, "num_trials": 10, "num_train_epochs": 10.0, "train_batch_size": 32, "eval_batch_size": 32, "predict_batch_size": 32}'

Fine-tuning with Hugging Face (HF) Transformers

To run fine-tuning with hugging face transformers users first need to convert the original checkpoint from TensorFlow to PyTorch (Note that you may need to clone the transformers' repo). This can be done by running:

MODEL_DIR=<pretrained checkpoint dir>
ELECTRA_EXPORT="discriminator"

python transformers/src/transformers/models/electra/convert_electra_original_tf_checkpoint_to_pytorch.py \
    --tf_checkpoint_path $MODEL_DIR \
    --config_file $MODEL_DIR/config.json \
    --pytorch_dump_path $MODEL_DIR/pytorch_model.bin \
    --discriminator_or_generator $ELECTRA_EXPORT

sample config.json for electra-small:

{
  "model_type": "electra",
  "model_size": "small",
  "vocab_size": 169300,
  "embedding_size": 128,
  "hidden_size": 256,
  "num_hidden_layers": 12,
  "num_attention_heads": 4,
  "intermediate_size": 1024,
  "generator_size": "0.25",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "attention_probs_dropout_prob": 0.1,
  "max_position_embeddings": 512,
  "type_vocab_size": 2,
  "initializer_range": 0.02
}

The downstream folder contains helper functions to download, load, and process downstream datasets and load custom metrics in a format expected by HF Transformers. The 'finetune.py' script runs the fine-tuning task. It can run on CPU, GPU, or TPU with minimal to no configuration.

Example 1: To fine-tune a RE task using the GAD dataset, run

python finetune.py --ckpt electra-small --loader re --dataset gad --output_dir finetune --greater_is_better --metric f1 --trials 3 --epochs 3

Example 2: To fine-tune a NER task using the NCBI-disease dataset, run

python finetune.py --ckpt electra-small --loader ner --dataset ncbi-disease --output_dir finetune --greater_is_better --metric f1 --trials 3 --epochs 3

Citation

If you use this code for your publication, please cite: