Passage Re-ranking with BERT

Introduction

***** Most of the code in this repository was copied from the original BERT repository.*****

This repository contains the code to reproduce our entry to the MSMARCO passage ranking task, which was placed first with a large margin over the second place. It also contains the code to reproduce our result on the TREC-CAR dataset, which is ~22 MAP points higher than the best entry from 2017 and a well-tuned BM25.

MSMARCO Passage Re-Ranking Leaderboard (Jan 8th 2019)	Eval MRR@10	Dev MRR@10
1st Place - BERT (this code)	35.87	36.53
2nd Place - IRNet	28.06	27.80
3rd Place - Conv-KNRM	27.12	29.02

TREC-CAR Test Set (Automatic Annotations)	MAP
BERT (this code)	33.5
BM25 Anserini	15.6
MacAvaney et al., 2017 (TREC-CAR 2017 Best Entry)	14.8

The paper describing our implementation is here.

Data

We made available the following data:

File	Description	Size	MD5
BERT_Large_trained_on_MSMARCO.zip	BERT-large trained on MS MARCO	3.4 GB	`2616f874cdabadafc55626035c8ff8e8`
BERT_Base_trained_on_MSMARCO.zip	BERT-base trained on MS MARCO	1.1 GB	`7a8c621e01c127b55dbe511812c34910`
MSMARCO_tfrecord.tar.gz	MS MARCO TF Records	9.1 GB	`c15d80fe9a56a2fb54eb7d94e2cfa4ef`
BERT_Large_dev_run.tsv	BERT-large run dev set (~6980 queries x 1000 docs per query)	121 MB	`bcbbe19bcb2549dea3f26168c2bc445b`
BERT_Large_test_run.tsv	BERT-large run test set (~6836 queries x 1000 docs per query)	119 MB	`9779903606e5b545f491132d8c2cf292`
BERT_Large_trained_on_TREC_CAR.tar.gz	BERT-large trained on TREC-CAR	3.4 GB	`8baedd876935093bfd2bdfa66f2279bc`
BERT_Large_pretrained_on_TREC_CAR...	BERT-large pretrained on TREC-CAR's training set for 1M iterations	3.4 GB	`9c6f2f8dbf9825899ee460ee52423b84`
treccar_files.tar.gz	TREC-CAR queries, qrels, runs, and TF Records	4.0 GB	`4e6b5580e0b2f2c709d76ac9c7e7f362`
bert_predictions_test.run.tar.gz	TREC-CAR 2017 Automatic Run reranked by BERT-Large	71M	`d5c135c6cf5a6d25199bba29d43b58ba`

MS MARCO

Download and extract the data

First, we need to download and extract MS MARCO and BERT files:

DATA_DIR=./data
mkdir ${DATA_DIR}

wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.eval.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/qrels.dev.small.tsv -P ${DATA_DIR}
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip -P ${DATA_DIR}

tar -xvf ${DATA_DIR}/triples.train.small.tar.gz -C ${DATA_DIR}
tar -xvf ${DATA_DIR}/top1000.dev.tar.gz -C ${DATA_DIR}
tar -xvf ${DATA_DIR}/top1000.eval.tar.gz -C ${DATA_DIR}
unzip ${DATA_DIR}/uncased_L-24_H-1024_A-16.zip -d ${DATA_DIR}

Convert MS MARCO to TFRecord format

Next, we need to convert MS MARCO train, dev, and eval files to TFRecord files, which will be later consumed by BERT.

mkdir ${DATA_DIR}/tfrecord
python convert_msmarco_to_tfrecord.py \
  --output_folder=${DATA_DIR}/tfrecord \
  --vocab_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/vocab.txt \
  --train_dataset_path=${DATA_DIR}/triples.train.small.tsv \
  --dev_dataset_path=${DATA_DIR}/top1000.dev.tsv \
  --eval_dataset_path=${DATA_DIR}/top1000.eval.tsv \
  --dev_qrels_path=${DATA_DIR}/qrels.dev.tsv \
  --max_query_length=64\
  --max_seq_length=512 \
  --num_eval_docs=1000

This conversion takes 30-40 hours. Alternatively, you may download the TFRecord files here (~23GB).

Training

We can now start training. We highly recommend using the free TPUs in our Google's Colab. Otherwise, a modern V100 GPU with 16GB cannot fit even a small batch size of 2 when training a BERT Large model.

In case you opt for not using the Colab, here is the command line to start training:

python run_msmarco.py \
  --data_dir=${DATA_DIR}/tfrecord \
  --bert_config_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_model.ckpt \
  --output_dir=${DATA_DIR}/output \
  --msmarco_output=True \
  --do_train=True \
  --do_eval=True \
  --num_train_steps=100000 \
  --num_warmup_steps=10000 \
  --train_batch_size=128 \
  --eval_batch_size=128 \
  --learning_rate=3e-6

Training for 100k iterations takes approximately 30 hours on a TPU v3. Alternatively, you can download the trained model used in our submission here (~3.4GB).

You can also download a BERT Base model trained on MS MARCO here. This model leads to ~2 points lower MRR@10 (34.7), but it is faster to train and evaluate. It can also fit on a single 12GB GPU.

TREC-CAR

We describe in the next sections how to reproduce our results on the TREC-CAR dataset.

Downloading qrels, run and TFRecord files

The next steps (Indexing, Retrieval, and TFRecord conversion) take many hours. Alternatively, you can skip them and download the necessary files for training and evaluation here (~4.0GB), namely:

queries (*.topics);
query-relevant passage pairs (*.qrels);
query-candidate passage pairs (*.run).
TFRecord files (*.tf)

After downloading, you need to extract them to the TRECCAR_DIR folder:

TRECCAR_DIR=./treccar/
tar -xf treccar_files.tar.gz --directory ${TRECCAR_DIR}

And you are ready to go to the training/evaluation section.

Downloading and Extracting the data

If you decided to index, retrieve and convert to the TFRecord format, you first need to download and extract the TREC-CAR data:

TRECCAR_DIR=./treccar/
DATA_DIR=./data
mkdir ${DATA_DIR}

wget http://trec-car.cs.unh.edu/datareleases/v2.0/paragraphCorpus.v2.0.tar.xz -P ${TRECCAR_DIR}
wget http://trec-car.cs.unh.edu/datareleases/v2.0/train.v2.0.tar.xz -P ${TRECCAR_DIR}
wget http://trec-car.cs.unh.edu/datareleases/v2.0/benchmarkY1-test.v2.0.tar.xz -P ${TRECCAR_DIR}
wget https://storage.googleapis.com/bert_treccar_data/pretrained_models/BERT_Large_pretrained_on_TREC_CAR_training_set_1M_iterations.tar.gz -P ${DATA_DIR}


tar -xf  ${TRECCAR_DIR}/paragraphCorpus.v2.0.tar.xz
tar -xf  ${TRECCAR_DIR}/train.v2.0.tar.xz
tar -xf  ${TRECCAR_DIR}/benchmarkY1-test.v2.0.tar.xz
tar -xzf ${DATA_DIR}/BERT_Large_pretrained_on_TREC_CAR_training_set_1M_iterations.tar.gz

Indexing TREC-CAR

We need to index the corpus and retrieve documents using the BM25 algorithm for each query so we have query-document pairs for training.

We index the TREC-CAR corpus using Anserini, an excelent toolkit for information retrieval research.

First, we need to install Maven, and clone and compile Anserini's repository:

sudo apt-get install maven
git clone --recurse-submodules https://github.com/castorini/Anserini.git
cd Anserini
mvn clean package appassembler:assemble
tar xvfz tools/eval/trec_eval.9.0.4.tar.gz -C tools/eval/ && cd tools/eval/trec_eval.9.0.4 && make
cd ../ndeval && make

Now we can index the corpus (.cbor files):

sh Anserini/target/appassembler/bin/IndexCollection -collection CarCollection \
-generator DefaultLuceneDocumentGenerator -threads 40 -input ./paragraphCorpus.v2.0 -index \
./lucene-index.car17.pos+docvectors+rawdocs -storePositions -storeDocvectors \
-storeRawDocs

You should see a message like this after it finishes:

2019-01-15 20:26:28,742 INFO  [main] index.IndexCollection (IndexCollection.java:578) - Total 29,794,689 documents indexed in 03:20:35

Retrieving pairs of query-candidate document

We now retrieve candidate documents for each query using the BM25 algorithm. But first, we need to convert the TREC-CAR files to a format that Anserini can consume.

First, we merge qrels folds 0, 1, 2, and 3 into a single file for training. Fold 4 will be the dev set.

for f in ${TRECCAR_DIR}/train/fold-[0-3]-base.train.cbor-hierarchical.qrels; do (cat "${f}"; echo); done >${TRECCAR_DIR}/train.qrels
cp ${TRECCAR_DIR}/train/fold-4-base.train.cbor-hierarchical.qrels ${TRECCAR_DIR}/dev.qrels
cp ${TRECCAR_DIR}/benchmarkY1/benchmarkY1-test/test.pages.cbor-hierarchical.qrels ${TRECCAR_DIR}/test.qrels

We need to extract the queries (first column in the space-separated files):

cat ${TRECCAR_DIR}/train.qrels | cut -d' ' -f1 > ${TRECCAR_DIR}/train.topics
cat ${TRECCAR_DIR}/dev.qrels | cut -d' ' -f1 > ${TRECCAR_DIR}/dev.topics
cat ${TRECCAR_DIR}/test.qrels | cut -d' ' -f1 > ${TRECCAR_DIR}/test.topics

And remove all duplicated queries:

sort -u -o ${TRECCAR_DIR}/train.topics ${TRECCAR_DIR}/train.topics
sort -u -o ${TRECCAR_DIR}/dev.topics ${TRECCAR_DIR}/dev.topics
sort -u -o ${TRECCAR_DIR}/test.topics ${TRECCAR_DIR}/test.topics

We now retrieve the top-10 documents per query for training and development sets.

nohup target/appassembler/bin/SearchCollection -topicreader Car -index ${TRECCAR_DIR}/lucene-index.car17.pos+docvectors+rawdocs -topics ${TRECCAR_DIR}/train.topics -output ${TRECCAR_DIR}/train.run -hits 10 -bm25 &

nohup target/appassembler/bin/SearchCollection -topicreader Car -index ${TRECCAR_DIR}/lucene-index.car17.pos+docvectors+rawdocs -topics ${TRECCAR_DIR}/dev.topics -output ${TRECCAR_DIR}/dev.run -hits 10 -bm25 &

And we retrieve top-1,000 documents per query for the test set.

nohup target/appassembler/bin/SearchCollection -topicreader Car -index ${TRECCAR_DIR}/lucene-index.car17.pos+docvectors+rawdocs -topics ${TRECCAR_DIR}/test.topics -output ${TRECCAR_DIR}/test.run -hits 1000 -bm25 &

After it finishes, you should see an output message like this:

(SearchCollection.java:166) - [Finished] Ranking with similarity: BM25(k1=0.9,b=0.4)
2019-01-16 23:40:56,538 INFO  [pool-2-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:167) - Run 2254 topics searched in 01:53:32
2019-01-16 23:40:56,922 INFO  [main] search.SearchCollection (SearchCollection.java:499) - Total run time: 01:53:36

This retrieval step takes 40-80 hours for the training set. We can speed it up by increasing the number of threads (ex: -threads 6) and loading the index into memory (-inmem option).

Measuring BM25 Performance (optional)

To be sure that indexing and retrieval worked fine, we can measure the performance of this list of documents retrieved with BM25:

eval/trec_eval.9.0.4/trec_eval -m map -m recip_rank -c ${TRECCAR_DIR}/test.qrels ${TRECCAR_DIR}/test.run

It is important to use the -c option as it assigns a score of zero to queries that had no passage returned. The output should be like this:

map                   	all	0.1528
recip_rank            	all	0.2294

Converting TREC-CAR to TFRecord

We can now convert qrels (query-relevant document pairs), run ( query-candidate document pairs), and the corpus into training, dev, and test TFRecord files that will be consumed by BERT. (we need to install CBOR package: pip install cbor)

python convert_treccar_to_tfrecord.py \
  --output_folder=${TRECCAR_DIR}/tfrecord \
  --vocab_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/vocab.txt \
  --corpus=${TRECCAR_DIR}/paragraphCorpus/dedup.articles-paragraphs.cbor \
  --qrels_train=${TRECCAR_DIR}/train.qrels \
  --qrels_dev=${TRECCAR_DIR}/dev.qrels \
  --qrels_test=${TRECCAR_DIR}/test.qrels \
  --run_train=${TRECCAR_DIR}/train.run \
  --run_dev=${TRECCAR_DIR}/dev.run \
  --run_test=${TRECCAR_DIR}/test.run \
  --max_query_length=64\
  --max_seq_length=512 \
  --num_train_docs=10 \
  --num_dev_docs=10 \
  --num_test_docs=1000

This step requires at least 64GB of RAM as we load the entire corpus onto memory.

Training/Evaluating

Before start training, you need to download a BERT Large model pretrained on the training set of TREC-CAR. This pretraining was necessary because the official pre-trained BERT models were pre-trained on the full Wikipedia, and therefore they have seen, although in an unsupervised way, Wikipedia documents that are used in the test set of TREC-CAR. Thus, to avoid this leak of test data into training, we pre-trained the BERT re-ranker only on the half of Wikipedia used by TREC-CAR’s training set.

Similar to MS MARCO training, we made available this Google Colab to train and evaluate on TREC-CAR.