Source code of CIKM2021 Long Paper:
Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need,
including the following two parts:
- Pre-training on corpus based on hyperlinks ✅
- Fine-tuning on MS MARCO Document Ranking Datasets 🌀
First, prepare a Python3 environment, and run the following commands:
git clone https://github.com/zhengyima/anchors.git anchors
cd anchors
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
Besides, you should download the BERT model checkpoint in format of huggingface transformers, and save them in a directory BERT_MODEL_PATH
. In our paper, we use the version of bert-base-uncased
. you can download it from the huggingface official model zoo, or Tsinghua mirror.
The corpus data should have one passage (in JSON format) per line, with the anchor texts saved in an array. e.g.
{
"sentence": one sentence s in source page,
"anchors": [
{
"text": anchor text of anchor a1,
"pos": [the start index of a1 in s, the end index of a1 in s],
"passage": the destination page of a1
},
{
"text": anchor text of anchor a2,
"pos": [the start index of a2 in s, the end index of a2 in s],
"passage": the destination page of a2
},
...
]
}
For your convenience, we provide the demo corpus file data/corpus/demo_data.txt
. You can refer to the demo data to generate the pre-trained corpus, such as from Wikipedia dump.
The process of generating the pre-training samples are complex, which has a long pipeline including four pre-training tasks. Thus, we build a shell shells/gendata.sh
to complete the whole process. If you are interested in the detailed process, you can refer to the shell. If you just want to run the code, you can run the following:
export CORPUS_DATA=./data/corpus/demo_data.txt
export DATA_PATH=./data/
export BERT_MODEL_PATH=/path/to/bert_model
bash shells/gendata.sh
After running gendata.sh
success, you will get the pre-training data stored in DATA_PATH/merged/
.
export PERTRAIN_OUTPUT_DIR=/path/to/output_path
bash shells/pretrain.sh
The process of fine-tuning is more complex than pre-training 💤
Thus, the author decides to pack and clean the fine-tuning part when he is free, such as the next weekend.
Notes: Since the pre-training of our model is completed in the standard manner of huggingface. So, you can apply the output checkpoints of pre-training into any down-stream method, just like using bert-base-uncased
.
If you use the code and datasets, please cite the following paper:
@article{DBLP:journals/corr/abs-2108-09346,
author = {Zhengyi Ma and
Zhicheng Dou and
Wei Xu and
Xinyu Zhang and
Hao Jiang and
Zhao Cao and
Ji{-}Rong Wen},
title = {Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need},
booktitle = {{CIKM} '21: The 30th {ACM} International Conference on Information
and Knowledge Management, Virtual Event, QLD, Australia, November 1-5, 2021},
publisher = {{ACM}},
year = {2021}
}