Webformer [Fork Late 2023]

Source code of SIGIR2022 Long Paper:

Webformer: Pre-training with Web Pages for Information Retrieval

Quick access to the PDF WebFormer Paper.

Pipeline

0. Preliminary process

This is a Fork of the original WebFormer paper. Original Git WebFormer.

1. Preinstallation

Prepare a Python3 environment.

Since WeebFormer is a 2022 paper, the dependencies are outadeted and installing them following the requirements.txt causes errors:

#ERROR PRONE INSTALLATION Original
cd Webformer
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

SSUGGESTED DEPENDENCIES INSTALLATION METHOD

Run requirements.ipynb to test wich libs are already installed then manually install via pip the missing ones, excluding pytrec-eval.

In the end run the bert_base_uncased downloader cell, to download and store bert on your local machine.

2. Get Corpus Data

Download datasets:

SWDE from Accademic Torrent.
Common Crawl from their link.

Create data folder running the cell inside requirements.ipynb.

All the downloaded datasets should be placed inside a folder corresponding to their topic within the Preprocess/data/endata directory.

3. Prepare the Corpus Data

Every piece of corpus data is the raw HTML code of a web page. Run the following commands to clear irrelevant content and get the training corpus:

  python Preprocess/html2json.py

Remember to set your data path in the code.

4. Prepare the Training Data

Use the json file output in the previous step to generate training data.

  bash construct_data.sh

5. Running Pre-training

 bash train.sh

Citations

@inproceedings{DBLP:conf/sigir/GuoMMQZJCD22,
  title     = {Webformer: Pre-training with Web Pages for Information 
  url       = {https://doi.org/10.1145/3477495.3532086},
}

ardizio/Webformer