/Simplified-TinyBERT

ECIR'21: Simplified TinyBERT: Knowledge Distillation for Document Retrieval

Primary LanguagePythonApache License 2.0Apache-2.0

Simplified-TinyBERT

This repository contains the code and resources for our paper:

Introduction

Simplified TinyBERT is a knowledge distillation (KD) model on BERT, designed for document retrieval task. Experiments on two widely used benchmarks MS MARCO and TREC 2019 Deep Learning (DL) Track demonstrate that Simplified TinyBERT not only boosts TinyBERT, but also significantly outperforms BERT-Base when providing 15 times speedup.

Requirements

We recommend using Anaconda to create the environment:

conda env create -f env.yaml

Then use pip to install some other required packages:

pip install -r requirements.txt

If you want to use Mixed Precision Training (fp16) during distillation, you also need to install apex into your environment.

Getting Started

In this repository, we provide instructions on how to run Simplified TinyBERT on MS MARCO and TREC 2019 DL document ranking tasks.

1. Data Preparation

  • For general distillation, you can get detail instruction in TinyBERT repo. But we provide the raw text from English Wikipedia used in our experiments, see in Resources.

  • For task-specific distillation, you can obtain MS MARCO and TREC 2019 DL datasets in the guidelines for TREC 2019 DL Track. The train triples are sampled from the corpus, documents are segmented into passages and positive passages are filtered. After that, you should get training examples: index, query_text, passage_text, label_id, then you can use tokenize_to_features.py script to tokenize examples to the input format of BERT. We release the processed training examples used in our experiments, see in Resources.

2. Model Training

As the model used is a passage-level BERT ranker, the teacher model used in our experiments is the BERT-Base fine-tuned on MS MARCO passage dataset released in dl4marco-bert. The checkpoint of student model is getting from the general distillation stage.

Now, distill the model:

bash ditill.sh

You can specify a KD method by setting the --distill_model in 'standard', 'simplified', which represents the Standard KD and Simplified TinyBERT, respectively. As for TinyBERT, please refer to TinyBERT repo.

3. Re-ranking

After distilling, you can re-rank candidates using the distilled models. You need to segment candidate documents into passages to get example_id, query_text, passage_text pairs, wherein the example_id should be query_id#passage_id, and tokenize pairs into features using tokenize_to_features.py script. We release the pairs of TREC 2019 DL test set, see in Resources.

Now, do BERT inference:

bash reranker.sh

Then, you can get relevance scores produced by the BERT ranker. The scores for query-passage pairs should be converted to document rank list in a TREC format using convert_to_trec_results.py script, wherein the aggregation way is set as MaxP in our experiments. As for evaluation metrics, you can use trec_eval or msmarco_mrr_eval.py provided in this repo.

Resources

We release some useful resources for the reproducibility of our experiments and other research uses.

  • The teacher model and distilled student models:
Model (L / H) Path
BERT-Base* (Teacher) (12 / 768) Download
Simplified TinyBERT (6 / 768) Download
Simplified TinyBERT (3 / 384) Download

* Note that the BERT-Base is the same one in dl4marco-bert, but we convert it to PyTorch model.

Citation

If you find our paper/code/resources useful, please cite:

@inproceedings{DBLP:conf/ecir/ChenHHSS21,
  author    = {Xuanang Chen and
               Ben He and
               Kai Hui and
               Le Sun and
               Yingfei Sun},
  title     = {Simplified TinyBERT: Knowledge Distillation for Document Retrieval},
  booktitle = {{ECIR} {(2)}},
  series    = {Lecture Notes in Computer Science},
  volume    = {12657},
  pages     = {241--248},
  publisher = {Springer},
  year      = {2021}
}

Acknowledgement

Some snippets of the code are borrowed from TinyBERT.