/HLP

HyperLink-induced Pre-training for Passage Retrieval in Open-domain Question Answering

Primary LanguagePython

Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering

This is the official implementation of our ACL'2022 paper "Hyperlink-induced Pre-training for Passage Retrieval in OpenQA".

[Update-20230223] We add evaluation on the widely used BEIR benchmark. See here

Quick Links

Overview

In this paper, we propose HyperLink-induced Pre-training (HLP), a pre-training method to learn effective Q-P relevance induced by the hyperlink topology within naturally-occurring Web documents. Specifically, these Q-P pairs are automatically extracted from the online documents with relevance adequately designed via hyperlink-based topology to facilitate downstream retrieval for question answering.

Note: the hyperlink-induced Q-P pairs are mostly semantically closed but lexically diverse, which could be considered/used as unsupervised paraphrase extracted from internet. Some examples are shared here.

Setup

Installation

  1. Installation from the source. Python's virtual or Conda environments are recommended.
git clone git@github.com:jzhoubu/HLP.git
cd HLP
conda create -n hlp python=3.7
conda activate hlp
pip install -r requirements.txt
  1. Please change the HLP_HOME variable in biencoder_train_cfg.yaml, gen_embs.yaml and dense_retriever.yaml. The HLP_HOME is the path to the HLP directory you download.

  2. You may also need to build apex.

git clone https://github.com/NVIDIA/apex
cd apex
python -m pip install -v --disable-pip-version-check --no-cache-dir ./

Prepare Data and Models

[Option1] Download Data via Command

bash downloader.sh

This command will automatically download the necessary data (about 50GB) for experiments.

[Option2] Download Data Manually

Please download these data to the pre-defined location in conf/*/*.yaml.

Dataset Download Links
train dev test corpus
HLP dl_10m.jsonl cm_10m.jsonl / / /
NQ nq-train.jsonl nq-dev.jsonl nq-test.qa.csv psgs_w100.tsv
TriviaQA trivia-train.jsonl trivia-dev.jsonl trivia-test.qa.csv
WebQA webq-train.jsonl webq-dev.jsonl webq-test.qa.csv
MS MARCO msmarco-train.jsonl msmarco-dev.jsonl /

Download Models

Models Trainset TrainConfig Size Zero-shot Performance
NQ TriviaQA WebQ
Top5 Top20 Top100 Top5 Top20 Top100 Top5 Top20 Top100
BM25 / / / 43.6 62.9 78.1 66.4 76.4 83.2 42.6 62.8 76.8
DL dl_10m pretrain_8xV100 418M 49.0 67.8 79.7 62.0 73.8 82.1 48.4 67.1 79.5
CM cm_10m pretrain_8xV100 418M 42.5 62.2 77.9 63.2 75.8 83.7 45.4 64.5 78.9
HLP dl_10m cm_10m pretrain_8xV100 418M 50.9 69.3 82.1 65.3 77.0 84.1 49.1 67.4 80.5
Models TuneSet TuneConfig Size Finetune Performance
HLP nq-train finetune_8xV100 840M 70.6 81.3 88.0 / / / / / /

More information of these checkpoints can be found in the model-card.

Experiments

Retriever Training

Below is an example to pre-train HLP.

python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \
    hydra.run.dir=./experiments/pretrain_hlp/train \
    val_av_rank_start_epoch=0 \
    train_datasets=[dl,cm] dev_datasets=[nq_dev] \
    train=pretrain_8xV100
  • hydra.run.dir: working directory of hydra (logs and checkpoints will be saved here).
  • val_av_rank_start_epoch: epoch number when we start use average ranking for validation.
  • train_datasets: alias of the train set name (see conf/datasets/train.yaml).
  • dev_datasets: alias of the dev set name (see conf/datasets/train.yaml).
  • train: a yaml file of training configuration (under conf/train)
  • See more configuration setting in biencoder_train_cfg.yaml and pretrain_8xV100.yaml.

Below is an example to fine-tune on NQ dataset using a pre-trained checkpoint:

python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \
    hydra.run.dir=./experiments/finetune_nq/train \
    model_file=../../pretrain_hlp/train/dpr_biencoder.best \
    train_datasets=[nq_train] dev_datasets=[nq_dev] \
    train=finetune_8xV100
  • model_file: a relative path to the model checkpoint

Note: To fine-tuning on NQ dataset, please also use train=finetune_8xV100 during the embedding phrase and the retrieval phrase.

Corpus Embedding

Generating representation vectors for the static documents dataset is a highly parallelizable process which can take up to a few days if computed on a single GPU. You might want to use multiple available GPU servers by running the script on each of them independently and specifying their own shards.

Below is an example to generate embeddings of the wikipedia corpus.

python ./generate_dense_embeddings.py \
    hydra.run.dir=./experiments/pretrain_hlp/embed \
    train=pretrain_8xV100 \
    model_file=../train/dpr_biencoder.best \
    ctx_src=dpr_wiki \
    shard_id=0 num_shards=1 \
    out_file=embedding_dpr_wiki \
    batch_size=10000
  • model_file: a relative path to the model checkpoint.
  • ctx_src: alias of the passages resource (see conf/ctx_sources/corpus.yaml).
  • out_file: prefix name of the output embedding.
  • shard_id: number(0-based) of data shard to process
  • num_shards: total amount of data shards

Retrieval Evalutaion

Below is an example to evaluate a model on NQ test set.

python dense_retriever.py \
	  hydra.run.dir=./experiments/pretrain_hlp/infer \
	  train=pretrain_8xV100 \
	  model_file=../train/dpr_biencoder.best \
	  qa_dataset=nq_test \
	  ctx_datatsets=[dpr_wiki] \
	  encoded_ctx_files=["../embed/embedding_dpr_wiki*"] \
	  out_file=nq_test.result \
  • model_file: a relative path to the model checkpoint
  • qa_dataset: alias of the test set (see conf/datasets/eval.yaml)
  • encoded_ctx_files: list of corpus embedding files glob expression
  • out_file: path of the output file

Others

Data Formats for Training Retriever

Below shows data format of our train and dev data (i.e. dl_10m.jsonl and nq-train.json). Our implementation can work with json and jsonl files. More format descriptions can refer to here.

[
  {
	"question": "....",
	"positive_ctxs": [{"title": "...", "text": "...."}],
	"negative_ctxs": [{"title": "...", "text": "...."}],
	"hard_negative_ctxs": [{"title": "...", "text": "...."}]
  },
  ...
]

Processed Wikipedai Graph

We also release our processed wikipedia graph which considers passages as nodes and hyperlinks as links. Further details can be found in the Section 3 in our paper. Click here to download.

import json, glob
from tqdm import tqdm
PATH = "/home/data/jzhoubu/wiki_20210301_processed/**/wiki_**.json" # change this path accordingly
files = glob.glob(PATH)
title2info = {}
for f in tqdm(files):
    sample = json.load(open(f, "r"))
    for k,v in sample.items():
        title2info[k] = v

print(len(title2info.keys())) 
# 22334994

print(title2info['Anarchism_0'])
# {'text': 
#    'Anarchism is a <SOE> political philosophy <EOE> and <SOE> movement <EOE> that is sceptical of <SOE> authority <EOE> and rejects all involuntary, coercive forms of <SOE> hierarchy <EOE> . Anarchism calls for the abolition of the <SOE> state <EOE> , which it holds to be undesirable, unnecessary, and harmful. It is usually described alongside <SOE> libertarian Marxism <EOE> as the libertarian wing ( <SOE> libertarian socialism <EOE> ) of the socialist movement and as having a historical association with <SOE> anti-capitalism <EOE> and <SOE> socialism <EOE> . The <SOE> history of anarchism <EOE> goes back to <SOE> prehistory <EOE> ,',
# 'mentions': 
#    ['political philosophy', 'movement', 'authority', 'hierarchy', 'state', 'libertarian Marxism', 'libertarian socialism', 'anti-capitalism', 'socialism', 'history of anarchism', 'prehistory'],
# 'linkouts': 
#    ['Political philosophy', 'Political movement', 'Authority', 'Hierarchy', 'State (polity)', 'Libertarian Marxism', 'Libertarian socialism', 'Anti-capitalism', 'Socialism', 'History of anarchism', 'Prehistory']
# }

Examples of HLP Q-P pairs

Query Passage
Title: Abby Kelley
Text: Liberty Farm in Worcester, Massachusetts , the home of Abby Kelley and Stephen Symonds Foster, was designated a National Historic Landmark because of its association with their lives of working for abolitionism.
Title: Worcester, Massachusetts
Text: Two of the nation’s most radical abolitionists, Abby Kelley Foster and her husband Stephen S. Foster, adopted Worcester as their home, as did Thomas Wentworth Higginson, the editor of The Atlantic Monthly and Emily Dickinson’s avuncular correspondent, and Unitarian minister Rev. Edward Everett Hale. The area was already home to Lucy Stone, Eli Thayer, and Samuel May Jr. They were joined in their political activities by networks of related Quaker families such as the Earles and the Chases, whose organizing efforts were crucial to ...
Title: Daniel Gormally
Text: In 2015 he tied for the second place with David Howell and Nicholas Pert in the 102nd British Championship andeventually finished fourth on tiebreak.
Title: Nicholas Pert
Text: In 2015, Pert tied for 2nd–4th with David Howell and Daniel Gormally, finishing third on tiebreak, in the British Chess Championship and later that year, he finished runner-up in the inaugural British Knockout Championship, which was held alongside the London Chess Classic. In this latter event, Pert, who replaced Nigel Short after his late withdrawal, eliminated Jonathan Hawkins in the quarterfinals and Luke McShane in the semifinals, then he lost to David Howell 4–6 in the final.

Zero-shot Performance on BEIR Benchmark

NDCG@10 performance Before and After tuning on MSMARCO dataset

Dataset Before After
ArguAna 34.4 51.8
climate-fever 20.9 17.0
DBPedia 30.3 33.5
FEVER 64.1 68.9
FiQA 13.2 25.8
HotpotQA 55.0 55.0
NFCorpus 29.1 32.9
NQ 23.6 45.6
SCIDOCS 12.8 14.7
SciFact 60.7 53.7
TREC-COVID 36.3 63.1
Touche-2020 8.9 21.8
Avg 32.4 40.3
MSMARCO-dev (MRR@10) 11.0

Citation

If you find this work useful, please cite the following paper:

@article{zhou2022hyperlink,
  title={Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering},
  author={Zhou, Jiawei and Li, Xiaoguang and Shang, Lifeng and Luo, Lan and Zhan, Ke and Hu, Enrui and Zhang, Xinyu and Jiang, Hao and Cao, Zhao and Yu, Fan and others},
  journal={arXiv preprint arXiv:2203.06942},
  year={2022}
}