/BitFit

Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

Primary LanguagePython

BitFit (Paper)

Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models

Abstract

We introduce BitFit, a sparse-finetuning method where only the bias-terms of the model (or a subset of them) are being modified. We show that with small-to-medium training data, applying BitFit on pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. For larger data, the method is competitive with other sparse fine-tuning methods. Besides their practical utility, these findings are relevant for the question of understanding the commonly-used process of finetuning: they support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.

Environment

First, create an environment with all the dependencies:

$ conda env create -n bitfit_env -f environment.yml

Then activate it:

$ conda activate bitfit_env

GLUE Benchmark evaluation examples:

python run_glue.py 
       --output-path <output_path>\
       --task-name <task_name>\
       --model-name <model_name>\
       --fine-tune-type <fine_tune_type>\
       --bias-terms <bias_terms>\
       --gpu-device <gpu_device>\
       --learning-rate <learning_rate>\
       --epochs <epochs>\
       --batch-size <batch_size>\
       --optimizer <optimizer_name>\
       --save-evaluator\
       --predict-test\
       --verbose

For further information about the arguments run:

python run_glue.py -h

Example of executing full fine tuning:

python run_glue.py 
       --output-path <output_path>\
       --task-name rte\  
       --model-name bert-base-cased\
       --fine-tune-type full_ft\
       --learning-rate 1e-5

Example of executing full BitFit (training all bias terms):

python run_glue.py 
       --output-path <output_path>\
       --task-name rte\
       --model-name bert-base-cased\
       --fine-tune-type bitfit\
       --learning-rate 1e-3

Example of executing partial BitFit (training a subset of the bias terms):

python run_glue.py 
       --output-path <output_path>\
       --task-name rte\
       --model-name bert-base-cased\
       --fine-tune-type bitfit\
       --bias-terms query intermediate\ 
       --learning-rate 1e-3

Example of executing "frozen" training (i.e. using the pre-trained transformer as a feature extractor):

python run_glue.py 
       --output-path <output_path>\
       --task-name rte\
       --model-name bert-base-cased\
       --fine-tune-type frozen\
       --learning-rate 1e-3

Example of training uniformly chosen trainable parameters (similar to "rand_100k" row in Table 3 in BitFit paper)

python run_glue.py 
       --output-path <output_path>\
       --task-name rte\
       --model-name bert-base-cased\
       --fine-tune-type rand_uniform\
       --learning-rate 1e-3