The framework provides a simple yet effective toolkit for defining, training, and evaluating learned sparse retrieval methods. The framework is composed of standalone modules, allowing for easy mixing and matching of different modules or integration with your own implementation. This provides flexibility to experiment and customize the retrieval model to meet your specific needs.
The structure of the lsr
package is as following:
├── configs #configuration of different components
│ ├── dataset
│ ├── experiment #define exp details: dataset, loss, model, hp
│ ├── loss
│ ├── model
│ └── wandb
├── datasets #implementations of dataset loading & collator
├── losses #implementations of different losses + regularizer
├── models #implementations of different models
├── tokenizer #a wrapper of HF's tokenizers
├── trainer #trainer for training
└── utils #common utilities used in different places
-
The list of all configurations used in the paper could be found here
-
The instruction for running experiments could be found here
Create conda
environemt:
conda create --name lsr python=3.9.12
conda activate lsr
Install dependencies with pip
pip install -r requirements.txt
We have included all pre-defined dataset configurations under lsr/configs/dataset
. Before starting training, ensure that you have the ir_datasets
and (huggingface) datasets
libraries installed, as the framework will automatically download and store the necessary data to the correct directories.
For datasets from ir_datasets
, the downloaded files are saved by default at ~/.ir_datasets/
. You can modify this path by changing the IR_DATASETS_HOME
environment variable.
Similarly, for datasets from the HuggingFace's datasets
, the downloaded files are stored at ~/.cache/huggingface/datasets
by default. To specify a different cache directory, set the HF_DATASETS_CACHE
environment variable.
To train a customed model on your own dataset, please use the sample configurations under lsr/config/dataset
as templates. Overall, you need three important files (see lsr/dataset_utils
for the file format):
- document collection: maps
document_id
todocument_text
- queries: maps
query_id
toquery_text
- train triplets or scored pairs:
- train triplets, used for contrastive learning, contains a list of <
query_id
,positive_document_id
,negative_document_id
> triplets. - scored_pairs, used for distillation training, contain pairs of <
query
,document_id
> with a relevance score.
- train triplets, used for contrastive learning, contains a list of <
To train a LSR model, you can just simply run the following command:
python -m lsr.train +experiment=sparta_msmarco_distil \
training_arguments.fp16=True
Please note that:
- In this command,
sparta_msmarco_distil
refers to the experiment configuration file located atlsr/configs/experiment/sparta_msmarco_distil.yaml
. If you wish to use a different experiment, simply change this value to the name of the desired configuration file underlsr/configs/experiment
. - You may notice a
+
beforeexperiment=sparta_msmarco_distil
. This is a convention in Hydra to add a new configuration key (in this case,experiment
) that is not yet defined in lsr/configs/config.yaml. If you want to override an existing key (e.g.,training_arguments.fp16
), you don't need to use the+
symbol - We trained some models using NVIDIA A100 80GB, allowing us to use large batch sizes (e.g., 128). To replicate our experiments on smaller GPUs, reduce the batch size and increase the gradient accumulation steps (e.g., add
training_arguments.per_device_train_batch_size=64 +training_arguments.gradient_accumulation_steps=2
to your training command). Note: With models (e.g., Splade) using sparse regularizers during training, the results may still differ slightly since we don't take accumulation steps into account for adjusting regularization weights. - We use
wandb
(by default) to monitor the training process, including loss, regularization, query length, and document length. If you wish to disable this feature, you can do so by addingtraining_arguments.report_to='none'
to the above command. Alternatively, you can follow the instructions here to set up wandb.
When the training finished, you can use our inference scripts to generate new queries and documents as following:
input_path=data/msmarco/dev_queries/raw.tsv
output_file_name=raw.tsv
batch_size=256
type='query'
python -m lsr.inference \
inference_arguments.input_path=$input_path \
inference_arguments.output_file=$output_file_name \
inference_arguments.type=$type \
inference_arguments.batch_size=$batch_size \
inference_arguments.scale_factor=100 \
+experiment=sparta_msmarco_distil
input_path=data/msmarco/full_collection/split/part01
output_file_name=part01
batch_size=256
type='doc'
python -m lsr.inference \
inference_arguments.input_path=$input_path \
inference_arguments.output_file=$output_file_name \
inference_arguments.type=$type \
inference_arguments.batch_size=$batch_size \
inference_arguments.scale_factor=100 \
inference_arguments.top_k=-400 \
+experiment=sparta_msmarco_distil \
Note:
- The
top_k
argument is the number of terms you want to keep; negativetop_k
means no pruning (all positive terms are kept). scale_factor
is used for weight quantization; float weights are multiplied by thisscale_factor
and rounded to the nearest integer.- The inference in document collection will take a long time. Therefore, it is better to split the collection into multiple partitions and run inference using multiple GPUs.
- All the generated queries and documents are stored in the
output/{exp_name}/inference/
directory by default, where theexp_name
parameter is defined in the experiment configuration file. You can change it as you like.
We made simple changes in the indexing procedure in Anserini to improve the indexing speed (by 10x
).
In the old method, Anserini first creates fake documents from JSON weight files (e.g., {"hello": 3}
) by repeating the term (e.g., "helo hello hello"
) and then indexes these documents as regular documents. The process of creating these fake documents can cause a substantial delay in indexing LSR where the number of terms and weights are usually large. To get rid of this issue, we leverage the FeatureField in Lucene to inject the (term, weight) pairs directly to the index. The change is simple but quite effective, especially when you have to index multiple times (as in the paper).
You can download the modified Anserini version here, then follow the instructions in the README for installation. If the tests fail, you can skip it by adding -Dmaven.test.skip=true
.
When the installation is done, you can continue with the next steps.
./anserini-lsr/target/appassembler/bin/IndexCollection \
-collection JsonSparseVectorCollection \
-input outputs/sparta_distil_sentence_transformers/inference/doc/ \
-index outputs/sparta_distil_sentence_transformers/index \
-generator SparseVectorDocumentGenerator \
-threads 60 -impact -pretokenized
Note that you have to change sparta_distil_sentence_transformers
to the output defined in your experiment configuation flie (here: lsr/configs/experiment/sparta_msmarco_distil.yaml
)
./anserini-lsr/target/appassembler/bin/SearchCollection \
-index outputs/sparta_distil_sentence_transformers/index/ \
-topics outputs/sparta_distil_sentence_transformers/inference/query/raw.tsv \
-topicreader TsvString \
-output outputs/sparta_distil_sentence_transformers/run.trec \
-impact -pretokenized -hits 1000 -parallelism 60
Here, you may need to change the output directory as in 5.2.
ir_measures qrels.msmarco-passage.dev-subset.txt outputs/sparta_distil_sentence_transformers/run.trec MRR@10 R@1000 NDCG@10
qrels.msmarco-passage.dev-subset.txt
is the qrels file for MSMARCO-dev in TREC format. You can find it on the MSMARCO or TREC DL(19,20) website. Note that for TREC DL (19,20), you have to change R@1000
to "R(rel=2)@1000"
(with the quote).
- RQ1: Are the results from recent LSR papers reproducible?
Results in Table 3 are the outputs of following experiments:
Method | Configuration |
---|---|
DeepCT | lsr/configs/experiment/deepct_msmarco_term_level.yaml |
uniCOIL | lsr/configs/experiment/unicoil_msmarco_multiple_negative.yaml |
uniCOILdT5q | lsr/configs/experiment/unicoil_doct5query_msmarco_multiple_negative.yaml |
uniCOILtilde | lsr/configs/experiment/unicoil_tilde_msmarco_multiple_negative.yaml |
EPIC | lsr/configs/experiment/epic_original.yaml |
DeepImpact | lsr/configs/experiment/deep_impact_original.yaml |
TILDEv2 | lsr/configs/experiment/tildev2_msmarco_multiple_negative.yaml |
Sparta | lsr/configs/experiment/sparta_original.yaml |
Splademax | lsr/configs/experiment/splade_msmarco_multiple_negative.yaml |
distilSplademax | lsr/configs/experiment/splade_msmarco_distil_flops_0.1_0.08.yaml |
- RQ2: How do LSR methods perform with recent advanced training techniques?
Results in Table 4 are the outputs of following experiments:
Method | Configuration |
---|---|
uniCOIL | lsr/configs/experiment/unicoil_msmarco_distil.yaml |
uniCOILdT5q | lsr/configs/experiment/unicoil_doct5query_msmarco_distil.yaml |
uniCOILtilde | lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml |
EPIC | lsr/configs/experiment/epic_msmarco_distil.yaml |
DeepImpact | lsr/configs/experiment/deep_impact_msmarco_distil.yaml |
TILDEv2 | lsr/configs/experiment/tildev2_msmarco_distil.yaml |
Sparta | lsr/configs/experiment/sparta_msmarco_distil.yaml |
distilSplademax | lsr/configs/experiment/splade_msmarco_distil_flops_0.1_0.08.yaml |
distilSpladesep | lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml |
- RQ3: How does the choice of encoder architecture and regularization affect results?
Results in Table 5 are the outputs of following experiments:
- MSMARCO Passage
Effect | Row | Configuration |
---|---|---|
Doc weighting | 1a | Before: lsr/configs/experiment/splade_asm_dbin_msmarco_distil.yaml After: lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml |
1b | Before: lsr/configs/experiment/unicoil_dbin_tilde_msmarco_distil.yaml After: lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml |
|
Query weighting | 2a | Before: lsr/configs/experiment/tildev2_msmarco_distil.yaml After: lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml |
2b | Before: lsr/configs/experiment/epic_qbin_msmarco_distil.yaml After: lsr/configs/experiment/epic_msmarco_distil.yaml |
|
Doc expansion | 3a | Before: lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml After: lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml |
3b | Before: lsr/configs/experiment/unicoil_msmarco_distil.yaml After: lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml |
|
Query expansion | 4a | Before: splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml After: lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml |
4b | Before: lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml After: lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml |
|
Regularization | 5a | Before: lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml After: lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.00.yaml |
- Tripclick
Effect | Row | Configuration |
---|---|---|
Doc weighting | 1a | Before: lsr/configs/experiment/qmlp_dbin_tripclick_multiple_negative.yaml After: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml |
1b | Before: lsr/configs/experiment/qmlp_dexpbin_tripclick_multiple_negative.yaml After: lsr/configs/experiment/unicoil_tilde_tripclick_multiple_negative.yaml |
|
Query weighting | 2a | Before: lsr/configs/experiment/sparta_tripclick_multiple_negative.yaml After: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_0.0_0.0.yaml |
2b | Before: lsr/configs/experiment/qbin_dmlp_tripclick_multiple_negative.yaml After: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml |
|
Doc expansion | 3a | Before: lsr/configs/experiment/qmlm_dmlp_tripclick_hard_negative_l1_0.001.yaml After: lsr/configs/experiment/splade_asm_tripclick_multiple_negative_l1_0.001_0.00001.yaml |
3b | Before: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml After: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml |
|
Query expansion | 4a | Before: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml After: lsr/configs/experiment/splade_asm_tripclick_multiple_negative_l1_0.001_0.00001.yaml |
4b | Before: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml After: lsr/configs/experiment/qmlm_dmlp_tripclick_hard_negative_l1_0.001.yaml |
|
Regularization | 5a | Before: lsr/configs/experiment/epic_tripclick_multiple_negative.yaml After: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml |
If you find this repository helpful, feel free to cite our paper A Unified Framework for Learned Sparse Retrieval
@inproceedings{nguyen2023unified,
title={A Unified Framework for Learned Sparse Retrieval},
author={Nguyen, Thong and MacAvaney, Sean and Yates, Andrew},
booktitle={Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2--6, 2023, Proceedings, Part III},
pages={101--116},
year={2023},
organization={Springer}
}