Source code of SIGIR2022 Long Paper:
Webformer: Pre-training with Web Pages for Information Retrieval
Quick access to the PDF WebFormer Paper.
This is a Fork of the original WebFormer paper. Original Git WebFormer.
Prepare a Python3 environment.
Since WeebFormer is a 2022 paper, the dependencies are outadeted and installing them following the requirements.txt causes errors:
#ERROR PRONE INSTALLATION Original
cd Webformer
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
SSUGGESTED DEPENDENCIES INSTALLATION METHOD
Run requirements.ipynb
to test wich libs are already installed then manually install via pip
the missing ones, excluding pytrec-eval
.
In the end run the bert_base_uncased
downloader cell, to download and store bert on your local machine.
Download datasets:
-
SWDE from Accademic Torrent.
-
Common Crawl from their link.
Create data
folder running the cell inside requirements.ipynb
.
All the downloaded datasets should be placed inside a folder corresponding to their topic within the Preprocess/data/endata
directory.
Every piece of corpus data is the raw HTML code of a web page. Run the following commands to clear irrelevant content and get the training corpus:
python Preprocess/html2json.py
Remember to set your data path in the code.
Use the json file output in the previous step to generate training data.
bash construct_data.sh
bash train.sh
@inproceedings{DBLP:conf/sigir/GuoMMQZJCD22,
title = {Webformer: Pre-training with Web Pages for Information
url = {https://doi.org/10.1145/3477495.3532086},
}