FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?

FlexiBERT is a tool which can be used to generate and evaluate different Transformer architectures on diverse NLP tasks. This repository has been forked from huggingface/transformers and then expanded to incorporate more heterogenous Transformer architectures. The proposed NAS technique, BOSHNAS, is available at jha-lab/boshnas.

Environment setup
- Clone this repository and initialize sub-modules
- Setup python environment
Replicating results
Pre-trained models
Developer
Cite this work
License

Environment setup

Clone this repository and initialize sub-modules

git clone https://github.com/jha-lab/txf_design-space.git
cd ./txf_design-space/
git submodule init
git submodule update

Setup python environment

The python environment setup is based on conda. The script below creates a new environment named txf_design-space:

source env_step.sh

To install using pip, use the following command:

pip install -r requirements.txt

To test the installation, you can run:

python check_install.py

All training scripts use bash and have been implemented using SLURM. This will have to be setup before running the experiments.

Replicating results

Specify the design space

For this, .yaml files can be used. Examples are given in the dataset/ directory. For the experiments in the paper, design_space/design_space_test.yaml was used.

Generate the graph library

This can be done in mutiple steps in the hierarchy. From the given design space: design_space/design_space_test.yaml, the graph library is created at dataset/dataset_test_bn.json with neighbors decided using biased overlap as follows:

cd embeddings/
python generate_library.py --design_space ../design_space/design_space_test.yaml --dataset_file ../dataset/dataset_test_bn.json --layers_per_stack 2
cd ../

Other flags can also be used to control the graph library generation (check using python embeddings/generate_library.py --help).

Prepare pre-training and fine-tuning datasets

Run the following scripts:

cd flexibert/
python prepare_pretrain_dataset.py
python save_roberta_tokenizer.py
python load_all_glue_datasets.py
python tokenize_glue_datasets.py
cd ../

Run BOSHNAS

For the selected graph library, run BOSHNAS with the following command:

cd flexibert/
python run_boshnas.py
cd ../

Other flags can be used to control the training procedure (check using python flexibert/run_boshnas.py --help). This script uses the SLURM scheduler over mutiple compute nodes in a cluster (each cluster assumed to have 2 GPUs, this can be changed in code). SLURM can also be used in scenarios where distributed nodes are not available.

Generate graph library for next level of hierarchy

To generate a graph library with layers_per_stack=1 from the best models in the first level, use the following command:

cd flexibert/
python hierarchical.py --old_dataset_file ../dataset/dataset_test_bn.json --new_dataset_file ../dataset/dataset_test_bn_2.json --old_layers_per_stack 2 --new_layers_per_stack 1 
cd ../

This saves a new graph library for the next level of the hierarchy. Heterogeneous feed-forward stacks can also be generated using the flag --heterogeneous_feed_forward.

For this new graph library, BOSHNAS can be run again to get the nest set of best-performing models.

Pre-trained models

The pre-trained models are accessible here.

To use the downloaded FlexiBERT-Mini model:

flexibert_mini = FlexiBERTModel.from_pretrained('./models/flexibert_mini/')

To instantiate a model in the FlexiBERT design space, create a model dictionary and generate a model configuration:

model_dict = {'l': 4, 'o': ['sa', 'sa', 'l', 'l'], 'h': [256, 256, 128, 128], 'n': [2, 2, 4, 4],
      'f': [[512, 512, 512], [512, 512, 512], [1024], [1024]], 'p': ['sdp', 'sdp', 'dct', 'dct']}
flexibert_mini_config = FlexiBERTConfig()
flexibert_mini_config.from_model_dict(model_dict)
flexibert_mini = FlexiBERTModel(flexibert_mini_config)

You can also use the FlexiBERT 2.0 hetero model dictionary format (paper under review). To transfer weights to another model within the design space (both should be in standard or hetero format):

model_dict = {'l': 2, 'o': ['sa', 'sa'], 'h': [128, 128], 'n': [2, 2],
      'f': [[512], [512]], 'p': ['sdp', 'sdp']}
bert_tiny_config = FlexiBERTConfig()
bert_tiny_config.from_model_dict(model_dict)
bert_tiny = FlexiBERTModel(bert_tiny_config, transfer_mode='RP')

# Implement fine-grained knowledge transfer using random projections
bert_tiny.load_model_from_source(flexibert_mini)

We will be adding more pre-trained models so stay tuned!

Developer

Shikhar Tuli. For any questions, comments or suggestions, please reach me at stuli@princeton.edu.

Cite this work

Cite our work using the following bitex entry:

@article{tuli2022jair,
      title={{FlexiBERT}: Are Current Transformer Architectures too Homogeneous and Rigid?}, 
      author={Tuli, Shikhar and Dedhia, Bhishma and Tuli, Shreshth and Jha, Niraj K.},
      year={2022},
      eprint={2205.11656},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

See License file for more details.

jha-lab/txf_design-space