This repository contains the scripts for encoding the collection with a trained SPLADE model and running
TAR experiments with TARexp
To setup the environment, please install the requirement packages through pip install.
pip install -r requirements.txt
The splade_encode.py
script encodes the documents into scipy.sparse
matrix for TARexp
.
When encoding with multiple GPUs, please use DDP through torchrun
.
The following command is an example using naver/splade-cocondenser-ensembledistil
to encode the Jeb Bush collection.
torchrun --nproc_per_node=4 splade_encode.py \
--text ./jb_info/raw_text.pkl \
--model naver/splade-cocondenser-ensembledistil \
--topk 0.1 \
--output ./splade_jb/splade.10p.pkl
The --text
flags accept both pickle and csv files that can be read by pandas.read_csv
or pandas.read_pickle
into a
pandas.DataFrame
. The --text_column
indicates the column containing the text to be encoded (default raw
).
Please see python splade_encode.py --help
for more information on the arguments.
The exp.py
script execute the TAR experiments through TARexp
.
File jb_cates.txt
and rcv1_cates.py
specify the categories used in the paper.
The following example command runs an experiment on category 100
in the Jeb Bush collection
with the matrix decoded from the SPLADE model and a matrix with BM25 saturated tf without idf.
python exp.py \
--ranker lr \
--rel_info ./rcv1_info/rel_info.pkl \
--experiment jb_test \
--matrix ./splade_jb/splade.10p.pkl ./splade_jb/bm25woidf.pkl \
--exp_topic 100 \
--output_root ./tarexp_experiments_to_80
Please see python exp.py --help
for more information on the arguments.
Please consider citing the following paper.
@inproceedings{splade-tar,
author = {Eugene Yang},
title = {Contextualization with SPLADE for High Recall Retrieval},
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (Short Paper)},
year = {2024},
doi = {10.1145/3626772.3657919}
}