This repository contains the code and resources for our paper:
Simplified TinyBERT is a knowledge distillation (KD) model on BERT, designed for document retrieval task. Experiments on two widely used benchmarks MS MARCO and TREC 2019 Deep Learning (DL) Track demonstrate that Simplified TinyBERT not only boosts TinyBERT, but also significantly outperforms BERT-Base when providing 15 times speedup.
We recommend using Anaconda to create the environment:
conda env create -f env.yaml
Then use pip
to install some other required packages:
pip install -r requirements.txt
If you want to use Mixed Precision Training (fp16) during distillation, you also need to install apex into your environment.
In this repository, we provide instructions on how to run Simplified TinyBERT on MS MARCO and TREC 2019 DL document ranking tasks.
-
For general distillation, you can get detail instruction in TinyBERT repo. But we provide the raw text from English Wikipedia used in our experiments, see in Resources.
-
For task-specific distillation, you can obtain MS MARCO and TREC 2019 DL datasets in the guidelines for TREC 2019 DL Track. The train triples are sampled from the corpus, documents are segmented into passages and positive passages are filtered. After that, you should get training examples:
index, query_text, passage_text, label_id
, then you can usetokenize_to_features.py
script to tokenize examples to the input format of BERT. We release the processed training examples used in our experiments, see in Resources.
As the model used is a passage-level BERT ranker, the teacher model used in our experiments is the BERT-Base fine-tuned on MS MARCO passage dataset released in dl4marco-bert. The checkpoint of student model is getting from the general distillation stage.
Now, distill the model:
bash ditill.sh
You can specify a KD method by setting the --distill_model
in 'standard', 'simplified'
,
which represents the Standard KD and Simplified TinyBERT, respectively. As for TinyBERT,
please refer to TinyBERT repo.
After distilling, you can re-rank candidates using the distilled models.
You need to segment candidate documents into passages to get example_id, query_text, passage_text
pairs,
wherein the example_id
should be query_id#passage_id
,
and tokenize pairs into features using tokenize_to_features.py
script.
We release the pairs of TREC 2019 DL test set, see in Resources.
Now, do BERT inference:
bash reranker.sh
Then, you can get relevance scores produced by the BERT ranker.
The scores for query-passage pairs should be converted to document rank list
in a TREC
format using convert_to_trec_results.py
script, wherein the
aggregation
way is set as MaxP
in our experiments. As for evaluation metrics,
you can use trec_eval
or msmarco_mrr_eval.py
provided in this repo.
We release some useful resources for the reproducibility of our experiments and other research uses.
- The teacher model and distilled student models:
Model (L / H) | Path |
---|---|
BERT-Base* (Teacher) (12 / 768) | Download |
Simplified TinyBERT (6 / 768) | Download |
Simplified TinyBERT (3 / 384) | Download |
* Note that the BERT-Base is the same one in dl4marco-bert, but we convert it to PyTorch model.
- Raw text from English Wikipedia for general distillation.
- Train examples for task-specific distillation.
- Sampled validation and test queries from MS MARCO Dev set.
- Test pairs of TREC 2019 DL for re-ranking.
- Run files of Simplified TinyBERT.
If you find our paper/code/resources useful, please cite:
@inproceedings{DBLP:conf/ecir/ChenHHSS21,
author = {Xuanang Chen and
Ben He and
Kai Hui and
Le Sun and
Yingfei Sun},
title = {Simplified TinyBERT: Knowledge Distillation for Document Retrieval},
booktitle = {{ECIR} {(2)}},
series = {Lecture Notes in Computer Science},
volume = {12657},
pages = {241--248},
publisher = {Springer},
year = {2021}
}
Some snippets of the code are borrowed from TinyBERT.