-
clone this directory
git clone https://github.com/usnistgov/trec_eval
-
Install the relevant dependencies
pip install -r requirements.txt
-
In the data directory download the relevant data sets (further instructions can be found in the data directory) *Need to add corpora set up (MS MARCO, KILT, WaPo)
-
In the scripts directory run the 3 stage pipeline of query rewriting, BM25, and reranking
*Need to add BM25 implementation in src and create script for it -
Evaluation of overall system performance can be done in the eval folder *Not tested yet- need correct Doc-IDs to test the script
The main structure of the repository is as follows:
trec_cast_2022
└── src
│ └── modules
│ └── utils
└── scripts
│ └── run_rewrite
│ └── run_bm25
│ └── run_reranking
└── outputs
│ └── queries
│ └── bm25
│ └── reranking
└── eval
└── data
- src: all the python code for the 3 stages
- scripts: the scripts to process the data for each stage
- outputs: where the output files of each stage is saved
- data: where all the external data is stored (used by src)
- eval: the evaluation scripts, used to measure overall end-to-end performance
-
To load python3.8 into your session (to make the virtual environment) use the following command
load module python3.8
-
To get exclusive access to a GPU machine (documentation)
sintr -A GALES-SL4-GPU -p pascal -t 2:0:0 --exclusive
can also use
-p ampere
to request exclusive access to the newer A100 GPUs (but a bit amoral!) -
Can alternatively request exclusive access to a CPU
sintr -A GALES-SL4-CPU -p skylake -t 2:0:0 --exclusive
Other clusters can be requested in a similar way (e.g. with
-p cclake
)