"ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset". -- Authors
For a detailed description and implementation refer to the paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, and codebase ELECTRA respectively.
Bio-ELECTRA is pretrained on:
- CORD-19 Abstracts
- Preprocessed PubMed texts
- BookCorpus
- Wikipedia
- Carefully curated PubMed abstracts consisting of the world's top chronic diseases.
- Python 3
- TensorFlow 1.15
- NumPy
- scikit-learn and SciPy
Use build_pretraining_dataset.py
to create a pre-training dataset from a dump of raw text.
Example:
python build_pretraining_dataset.py \
--corpus-dir $corpus_path \
--vocab-file $model_path/vocab.txt \
--output-dir $model_path/pretrain_tfrecords \
--max-seq-length 128 \
--blanks-separate-docs False \
--do-lower-case \
--do-strip-accents \
--num-processes 8
Use run_pretraining.py
to pre-train an ELECTRA model.
Example:
python3 electra_repo/run_pretraining.py \
--data-dir $model_path \
--model-name $model_name \
--hparams '{"model_size": "small", "num_train_steps": 1000000, "vocab_size": 169300}'
Refer to configure_pretraining.py
to view/set the supported hyperparameters.
Use run_finetuning.py
to fine-tune and evaluate an ELECTRA model on a downstream NLP task.
This code implementation performs fine-tuning on two main downstream tasks: Named-Entity Recognition (NER) (Token Classification), and Relation Extraction (RE) (Sequence Classification).
The finetune
directory contains code to run each task.
NER datasets: BC2GM, BC5CDR-chem, JNLPBA, NCBI-disease, BC4CHEMD, BC5CDR-disease, linnaeus, s800
RE datasets: CHEMPROT, GAD, and DDI
DATA_DIR=<pretrained checkpoint dir>
python run_finetuning.py \
--data-dir $DATA_DIR \
--model-name electra-small \
--hparams '{"model_size": "small", "task_names": ["gad"], "learning_rate": 3e-4, "max_seq_length": 128, "num_trials": 10, "num_train_epochs": 10.0, "train_batch_size": 32, "eval_batch_size": 32, "predict_batch_size": 32}'
DATA_DIR=<pretrained checkpoint dir>
python run_finetuning.py \
--data-dir $DATA_DIR \
--model-name electra-small \
--hparams '{"model_size": "small", "task_names": ["ncbi-disease"], "learning_rate": 3e-4, "max_seq_length": 128, "num_trials": 10, "num_train_epochs": 10.0, "train_batch_size": 32, "eval_batch_size": 32, "predict_batch_size": 32}'
To run fine-tuning with hugging face transformers users first need to convert the original checkpoint from TensorFlow to PyTorch (Note that you may need to clone the transformers' repo). This can be done by running:
MODEL_DIR=<pretrained checkpoint dir>
ELECTRA_EXPORT="discriminator"
python transformers/src/transformers/models/electra/convert_electra_original_tf_checkpoint_to_pytorch.py \
--tf_checkpoint_path $MODEL_DIR \
--config_file $MODEL_DIR/config.json \
--pytorch_dump_path $MODEL_DIR/pytorch_model.bin \
--discriminator_or_generator $ELECTRA_EXPORT
sample config.json
for electra-small:
{
"model_type": "electra",
"model_size": "small",
"vocab_size": 169300,
"embedding_size": 128,
"hidden_size": 256,
"num_hidden_layers": 12,
"num_attention_heads": 4,
"intermediate_size": 1024,
"generator_size": "0.25",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"attention_probs_dropout_prob": 0.1,
"max_position_embeddings": 512,
"type_vocab_size": 2,
"initializer_range": 0.02
}
The downstream
folder contains helper functions to download, load, and process downstream datasets and load custom metrics in a format expected by HF Transformers.
The 'finetune.py' script runs the fine-tuning task. It can run on CPU, GPU, or TPU with minimal to no configuration.
python finetune.py --ckpt electra-small --loader re --dataset gad --output_dir finetune --greater_is_better --metric f1 --trials 3 --epochs 3
python finetune.py --ckpt electra-small --loader ner --dataset ncbi-disease --output_dir finetune --greater_is_better --metric f1 --trials 3 --epochs 3
If you use this code for your publication, please cite: