/TREC_WebTrack

Relevance ranking for Ad-hoc Retrieval. This is a repository used to employ Machine Learning models on the TREC Web Track.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

TREC WebTrack

This is a repository used to employ Machine Learning models on the adhoc task, TREC Web Track. Any issues, PRs or suggestions will be welcome.

To be more specific, these models are reranking models for query-document pairs. Since the cost of computing the relevance score for every query-document pair is too high, the objective is to rerank the QL submissions of each year, that you can find in here.

These models are capable of ordering a list of text documents according to their relevance to a particular query. It is possible to use this repository to train your rerank models or use a pre-trained on custom data, i.e., a set of queries and documents.

Currently, there are 2 models implemented, described in:

Their implementation was adapted from the official release.

To install run (Python 3.5+):

python setup.py develop

Create a softlink to point to your stored data:

ln -s path_to_your_data DATA

Under your DATA directory you'll need to have different data, depending whether:

Reproduce TREC Web Track results

For now, I only provide scripts to reproduce 2013 and 2014 results. However, you can change the bash and config files accordingly to run on other configurations. Throughout the instructions, replace gpu_device with the CUDA_ID you want to run with (replace with None for running on CPU).

Under your DATA directory, download the official similarity matrices, provided by the authors, and extract them using:

cd DATA
mkdir corpora
cd corpora
tar xvf simmat.tar.gz

Also download the query idf vectors:

cd DATA/corpora/
unzip query_idf.zip

Now you can either train and test or test only the PACRR model.

  • For train and test, run any script under bin/test13 or bin/test14 as the following example:

    bash bin/test1*/run_pacrr_1*val.sh gpu_device
    

    or, to run using a round-robin procedure (will take longer):

    bash bin/test1*/run_pacrr_test1*.sh gpu_device
    
  • For test only, you'll need to download my weights files and extract them under DATA.

    cd DATA
    unzip model_outputs.zip
    

    Then comment the part of the bin/test1*/ bash files that call script/train.py and run the same way as described for train and test.

Using a pre-trained model on your data

You'll need to download my weights files and extract them under DATA.

  cd DATA
  unzip model_outputs.zip

and the pretrained embeddings cd DATA unzip embeddings.zip

Now, you will have to change the file qrels/customdata.txt according to your data. As you can see in the example file I have, the file is constructed with the following format:

query text
document text

(...)

query text
document text

Just change this file with your queries and document and run:

bash bin/run_pacrr_customdata.sh

At the end, a message will be printed by that script saying where the test.probs file was saved. That file contains the relevance scores of every query-document pair you inserted in qrels/customdata.txt.