FastFormers provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Understanding (NLU) including the demo models showing 233.87x speed-up (Yes, 233x on CPU with the multi-head self-attentive Transformer architecture. This is not an LSTM or an RNN). The details of the methods and analyses are described in the paper FastFormers: Highly Efficient Transformer Models for Natural Language Understanding paper.
- (June 3, 2021) The public onnxruntime (v1.8.0) now supports all FastFormers models. Special thanks to @yufenglee and onnxruntime team.
- (Nov. 4, 2020) We are actively working with Hugging Face and onnxruntime team so that you can utilize the features out of the box of huggingface's transformers and onnxruntime. Please stay tuned.
- With this repository, you can replicate the results presented in the FastFormers paper.
- The demo models of FastFormers are implemented with SuperGLUE benchmark. Data processing pipeline is based on Alex Wang's implementation reference code for SustaiNLP which is a fork from HuggingFace's transformers repository.
- This repository is built on top of several open source projects including transformers from HuggingFace, onnxruntime, transformers from Alex Wang, FBGEMM, TinyBERT and etc.
- FastFormers currently only supports Linux operating systems.
- CPU requirements:
- CPUs equipped with at least one, or both of
AVX2
andAVX512
instruction sets are required. To get the full speed improvements and accuracy,AVX512
instruction set is required. We have tested our runtime on Intel CPUs.
- CPUs equipped with at least one, or both of
- GPU requirements:
- To utilize 16-bit floating point speed-up, GPUs with Volta or later architectures are required.
- onnxruntime v1.8.0+ is required to run FastFormers models.
- This repository is a branch of transformers, so you need to uninstall pre-existing transformers in your python environment.
This repo is tested on Python 3.6 and 3.7, PyTorch 1.5.0+.
You need to uninstall pre-existing transformers package as this repository uses customized versions of it.
You need to install PyTorch 1.5.0+. Then, execute following bash commands. You need to install onnxruntime 1.8.0+.
pip install onnxruntime==1.8.0 --user --upgrade --no-deps --force-reinstall
pip uninstall transformers -y
git clone https://github.com/microsoft/fastformers
cd fastformers
pip install .
All the models used to benchmark Table 3 in the paper are publicly shared. You can use below commands to reproduce the results. Table 3 measurement was done on one of the Azure F16s_v2 instances.
The installation step needs to be done before proceeding.
-
Download SuperGLUE dataset and decompress.
-
Download demo model files and decompress.
wget https://github.com/microsoft/fastformers/releases/download/v0.1-model/teacher-bert-base.tar.gz
wget https://github.com/microsoft/fastformers/releases/download/v0.2-model/student-4L-312.tar.gz
wget https://github.com/microsoft/fastformers/releases/download/v0.2-model/student-pruned-8h-600.tar.gz
wget https://github.com/microsoft/fastformers/releases/download/v0.2-model/student-pruned-9h-900.tar.gz
- Run the teacher model (BERT-base) baseline
python3 examples/fastformers/run_superglue.py \
--model_type bert --model_name_or_path ${teacher_model} \
--task_name BoolQ --output_dir ${out_dir} --do_eval \
--data_dir ${data_dir} --per_instance_eval_batch_size 1 \
--use_fixed_seq_length --do_lower_case --max_seq_length 512 \
--no_cuda
- Run the teacher model (BERT-base) with dynamic sequence length
python3 examples/fastformers/run_superglue.py \
--model_type bert --model_name_or_path ${teacher_model} \
--task_name BoolQ --output_dir ${out_dir} --do_eval \
--data_dir ${data_dir} --per_instance_eval_batch_size 1 \
--do_lower_case --max_seq_length 512 --no_cuda
- Run the distilled student model (PyTorch)
python3 examples/fastformers/run_superglue.py \
--model_type bert --model_name_or_path ${student_model} \
--task_name BoolQ --output_dir ${out_dir} --do_eval \
--data_dir ${data_dir} --per_instance_eval_batch_size 1 \
--do_lower_case --max_seq_length 512 --no_cuda
- Run the distilled student with 8-bit quantization (onnxruntime)
python3 examples/fastformers/run_superglue.py \
--model_type bert --model_name_or_path ${student_model} \
--task_name BoolQ --output_dir ${out_dir} --do_eval \
--data_dir ${data_dir} --per_instance_eval_batch_size 1 \
--do_lower_case --max_seq_length 512 --use_onnxrt --no_cuda
- Run the distilled student with 8-bit quantization + multi-intance inference (onnxruntime)
OMP_NUM_THREADS=1 python3 examples/fastformers/run_superglue.py \
--model_type bert \
--model_name_or_path ${student_model} \
--task_name BoolQ --output_dir ${out_dir} --do_eval \
--data_dir ${data_dir} --per_instance_eval_batch_size 1 \
--do_lower_case --max_seq_length 512 --use_onnxrt \
--threads_per_instance 1 --no_cuda
- Run the distilled + pruned student with 8-bit quantization + multi-intance inference (onnxruntime)
OMP_NUM_THREADS=1 python3 examples/fastformers/run_superglue.py \
--model_type bert \
--model_name_or_path ${pruned_student_model} \
--task_name BoolQ --output_dir ${out_dir} --do_eval \
--data_dir ${data_dir} --per_instance_eval_batch_size 1 \
--do_lower_case --max_seq_length 512 --use_onnxrt \
--threads_per_instance 1 --no_cuda
This is used for fine-tuning of pretrained or general distilled model (task-agnostic distillation) to the downstream tasks. Currently, BERT and RoBERTa models are supported.
Tip 1. This repository is based on transformers, so you can use huggingface's pre-trained models. (e.g. set distilroberta-base
for --model_name_or_path to use distilroberta-base)
Tip 2. Before fine-tuning models, you can change the activation functions to ReLU to get better inference speed. To do this, you can download the config file of your model and manually change it to relu
(hidden_act
in case of BERT and ReBERTa models). Then, you can specify the config file by adding parameter (--config_name).
Tip 3. Depending on the task and the models used, you can add --do_lower_case if it give a better accuracy.
python3 examples/fastformers/run_superglue.py \
--data_dir ${data_dir} --task_name ${task} \
--output_dir ${out_dir} --model_type ${model_type} \
--model_name_or_path ${model} \
--use_gpuid ${gpuid} --seed ${seed} \
--do_train --max_seq_length ${seq_len_train} \
--do_eval --eval_and_save_steps ${eval_freq} --save_only_best \
--learning_rate 0.00001 \
--warmup_ratio 0.06 --weight_decay 0.01 \
--per_gpu_train_batch_size 4 \
--gradient_accumulation_steps 1 \
--logging_steps 100 --num_train_epochs 10 \
--overwrite_output_dir --per_instance_eval_batch_size 8
This is used for distilling fine-tuned teacher models into smaller student models (task-specific distillation) on the downstream tasks. As described in the paper, it is critical to initialize student models with general distilled models such as distilbert-, distilroberta-base and TinyBERT.
This command is also used to distill non-pruned models into pruned models.
This command always uses task specific logit loss between teacher and student models for the student training. You can add addtional losses for hidden states (including token mbedding) and attentions between teacher and student. To use hidden states and attentions distillation, the number of teacher layers should be multiples of the number of student layers.
python3 examples/fastformers/run_superglue.py \
--data_dir ${data_dir} --task_name ${task} \
--output_dir ${out_dir} --teacher_model_type ${teacher_model_type} \
--teacher_model_name_or_path ${teacher_model} \
--model_type ${student_model_type} --model_name_or_path ${student_model} \
--use_gpuid ${gpuid} --seed ${seed} \
--do_train --max_seq_length ${seq_len_train} \
--do_eval --eval_and_save_steps ${eval_freq} --save_only_best \
--learning_rate 0.00001 \
--warmup_ratio 0.06 --weight_decay 0.01 \
--per_gpu_train_batch_size 4 \
--gradient_accumulation_steps 1 \
--logging_steps 100 --num_train_epochs 10 \
--overwrite_output_dir --per_instance_eval_batch_size 8 \
--state_loss_ratio 0.1
This command performs structured pruning on the models described in the paper. It reduces the number of heads and the intermediate hidden states of FFN as set in the options. When the pruning is done on GPU, only 1 GPU is utilized (no multi-GPU).
To get better accuracy, you can do another round of knowledge distillation after the pruning.
python3 examples/fastformers/run_superglue.py \
--data_dir ${data_dir} --task_name ${task} \
--output_dir ${out_dir} --model_type ${model_type} \
--model_name_or_path ${model} --do_eval \
--do_prune --max_seq_length ${seq_len_train} \
--per_instance_eval_batch_size 1 \
--target_num_heads 8 --target_ffn_dim 600
This command convert your PyTorch transformers models into optimized onnx format with 8-bit quantization. The converted ONNX model is saved in the directory which the original PyTorch model is located.
python3 examples/fastformers/run_superglue.py \
--task_name ${task} \
--model_type ${model_type} \
--model_name_or_path ${model} \
--convert_onnx
This command convert your PyTorch transformers models into 16-bit floating point model (PyTorch). This creates a new directory named fp16
in the directory the original model is located. Then, the converted fp16 model and all necessary files are saved to the directory.
python3 examples/fastformers/run_superglue.py \
--task_name ${task} \
--model_type ${model_type} \
--model_name_or_path ${model} \
--convert_fp16
This command evalutes various models with PyTorch or onnxruntime engine on the give tasks. For more detailed usage, please refer to the demo section.
OMP_NUM_THREADS=1 python3 examples/fastformers/run_superglue.py \
--model_type bert \
--model_name_or_path ${pruned_student_model} \
--task_name BoolQ --output_dir ${out_dir} --do_eval \
--data_dir ${data_dir} --per_instance_eval_batch_size 1 \
--do_lower_case --max_seq_length 512 --use_onnxrt \
--threads_per_instance 1 --no_cuda
This project has adopted the Microsoft Open Source Code of Conduct.
This project is licensed under the MIT License.