/learned-sparse-retrieval

Unified Learned Sparse Retrieval Framework

Primary LanguagePythonApache License 2.0Apache-2.0

DOI

LSR: A unified framework for efficient and effective learned sparse retrieval

The framework provides a simple yet effective toolkit for defining, training, and evaluating learned sparse retrieval methods. The framework is composed of standalone modules, allowing for easy mixing and matching of different modules or integration with your own implementation. This provides flexibility to experiment and customize the retrieval model to meet your specific needs.

The structure of the lsr package is as following:

├── configs  #configuration of different components
│   ├── dataset 
│   ├── experiment #define exp details: dataset, loss, model, hp 
│   ├── loss 
│   ├── model
│   └── wandb
├── datasets    #implementations of dataset loading & collator
├── losses  #implementations of different losses + regularizer
├── models  #implementations of different models
├── tokenizer   #a wrapper of HF's tokenizers
├── trainer     #trainer for training 
└── utils   #common utilities used in different places
  • The list of all configurations used in the paper could be found here

  • The instruction for running experiments could be found here

Training and inference instructions

1. Create conda environment and install dependencies:

Create conda environemt:

conda create --name lsr python=3.9.12
conda activate lsr

Install dependencies with pip

pip install -r requirements.txt

2. Downwload/Prepare datasets

We have included all pre-defined dataset configurations under lsr/configs/dataset. Before starting training, ensure that you have the ir_datasets and (huggingface) datasets libraries installed, as the framework will automatically download and store the necessary data to the correct directories.

For datasets from ir_datasets, the downloaded files are saved by default at ~/.ir_datasets/. You can modify this path by changing the IR_DATASETS_HOME environment variable.

Similarly, for datasets from the HuggingFace's datasets, the downloaded files are stored at ~/.cache/huggingface/datasets by default. To specify a different cache directory, set the HF_DATASETS_CACHE environment variable.

To train a customed model on your own dataset, please use the sample configurations under lsr/config/dataset as templates. Overall, you need three important files (see lsr/dataset_utils for the file format):

  • document collection: maps document_id to document_text
  • queries: maps query_id to query_text
  • train triplets or scored pairs:
    • train triplets, used for contrastive learning, contains a list of <query_id, positive_document_id, negative_document_id> triplets.
    • scored_pairs, used for distillation training, contain pairs of <query, document_id> with a relevance score.

3. Train a model

To train a LSR model, you can just simply run the following command:

python -m lsr.train +experiment=sparta_msmarco_distil \
training_arguments.fp16=True 

Please note that:

  • In this command, sparta_msmarco_distil refers to the experiment configuration file located at lsr/configs/experiment/sparta_msmarco_distil.yaml. If you wish to use a different experiment, simply change this value to the name of the desired configuration file under lsr/configs/experiment.
  • You may notice a + before experiment=sparta_msmarco_distil. This is a convention in Hydra to add a new configuration key (in this case, experiment) that is not yet defined in lsr/configs/config.yaml. If you want to override an existing key (e.g., training_arguments.fp16), you don't need to use the + symbol
  • We trained some models using NVIDIA A100 80GB, allowing us to use large batch sizes (e.g., 128). To replicate our experiments on smaller GPUs, reduce the batch size and increase the gradient accumulation steps (e.g., add training_arguments.per_device_train_batch_size=64 +training_arguments.gradient_accumulation_steps=2 to your training command). Note: With models (e.g., Splade) using sparse regularizers during training, the results may still differ slightly since we don't take accumulation steps into account for adjusting regularization weights.
  • We use wandb (by default) to monitor the training process, including loss, regularization, query length, and document length. If you wish to disable this feature, you can do so by adding training_arguments.report_to='none' to the above command. Alternatively, you can follow the instructions here to set up wandb.

4. Run inference on MSMARCO dataset

When the training finished, you can use our inference scripts to generate new queries and documents as following:

4.1 Generate queries

input_path=data/msmarco/dev_queries/raw.tsv
output_file_name=raw.tsv
batch_size=256
type='query'
python -m lsr.inference \
inference_arguments.input_path=$input_path \
inference_arguments.output_file=$output_file_name \
inference_arguments.type=$type \
inference_arguments.batch_size=$batch_size \
inference_arguments.scale_factor=100 \
+experiment=sparta_msmarco_distil 

4.2 Generate documents

input_path=data/msmarco/full_collection/split/part01
output_file_name=part01
batch_size=256
type='doc'
python -m lsr.inference \
inference_arguments.input_path=$input_path \
inference_arguments.output_file=$output_file_name \
inference_arguments.type=$type \
inference_arguments.batch_size=$batch_size \
inference_arguments.scale_factor=100 \
inference_arguments.top_k=-400  \
+experiment=sparta_msmarco_distil \ 

Note:

  • The top_k argument is the number of terms you want to keep; negative top_k means no pruning (all positive terms are kept).
  • scale_factor is used for weight quantization; float weights are multiplied by this scale_factor and rounded to the nearest integer.
  • The inference in document collection will take a long time. Therefore, it is better to split the collection into multiple partitions and run inference using multiple GPUs.
  • All the generated queries and documents are stored in theoutput/{exp_name}/inference/ directory by default, where the exp_name parameter is defined in the experiment configuration file. You can change it as you like.

5. Index generated documents

5.1 Download and install our modified Anserini indexing software:

We made simple changes in the indexing procedure in Anserini to improve the indexing speed (by 10x). In the old method, Anserini first creates fake documents from JSON weight files (e.g., {"hello": 3}) by repeating the term (e.g., "helo hello hello") and then indexes these documents as regular documents. The process of creating these fake documents can cause a substantial delay in indexing LSR where the number of terms and weights are usually large. To get rid of this issue, we leverage the FeatureField in Lucene to inject the (term, weight) pairs directly to the index. The change is simple but quite effective, especially when you have to index multiple times (as in the paper).
You can download the modified Anserini version here, then follow the instructions in the README for installation. If the tests fail, you can skip it by adding -Dmaven.test.skip=true.

When the installation is done, you can continue with the next steps.

5.2 Index with Anserini

./anserini-lsr/target/appassembler/bin/IndexCollection \
-collection JsonSparseVectorCollection \
-input outputs/sparta_distil_sentence_transformers/inference/doc/  \
-index outputs/sparta_distil_sentence_transformers/index \
-generator SparseVectorDocumentGenerator \
-threads 60 -impact -pretokenized

Note that you have to change sparta_distil_sentence_transformers to the output defined in your experiment configuation flie (here: lsr/configs/experiment/sparta_msmarco_distil.yaml)

6. Search on the Inverted Index

./anserini-lsr/target/appassembler/bin/SearchCollection \
-index outputs/sparta_distil_sentence_transformers/index/  \
-topics outputs/sparta_distil_sentence_transformers/inference/query/raw.tsv \
-topicreader TsvString \
-output outputs/sparta_distil_sentence_transformers/run.trec \
-impact -pretokenized -hits 1000 -parallelism 60

Here, you may need to change the output directory as in 5.2.

7. Evaluate the run file

ir_measures qrels.msmarco-passage.dev-subset.txt outputs/sparta_distil_sentence_transformers/run.trec MRR@10 R@1000 NDCG@10

qrels.msmarco-passage.dev-subset.txt is the qrels file for MSMARCO-dev in TREC format. You can find it on the MSMARCO or TREC DL(19,20) website. Note that for TREC DL (19,20), you have to change R@1000 to "R(rel=2)@1000" (with the quote).

List of configurations used in the paper

  • RQ1: Are the results from recent LSR papers reproducible?

Results in Table 3 are the outputs of following experiments:

Method Configuration
DeepCT lsr/configs/experiment/deepct_msmarco_term_level.yaml
uniCOIL lsr/configs/experiment/unicoil_msmarco_multiple_negative.yaml
uniCOILdT5q lsr/configs/experiment/unicoil_doct5query_msmarco_multiple_negative.yaml
uniCOILtilde lsr/configs/experiment/unicoil_tilde_msmarco_multiple_negative.yaml
EPIC lsr/configs/experiment/epic_original.yaml
DeepImpact lsr/configs/experiment/deep_impact_original.yaml
TILDEv2 lsr/configs/experiment/tildev2_msmarco_multiple_negative.yaml
Sparta lsr/configs/experiment/sparta_original.yaml
Splademax lsr/configs/experiment/splade_msmarco_multiple_negative.yaml
distilSplademax lsr/configs/experiment/splade_msmarco_distil_flops_0.1_0.08.yaml
  • RQ2: How do LSR methods perform with recent advanced training techniques?

Results in Table 4 are the outputs of following experiments:

Method Configuration
uniCOIL lsr/configs/experiment/unicoil_msmarco_distil.yaml
uniCOILdT5q lsr/configs/experiment/unicoil_doct5query_msmarco_distil.yaml
uniCOILtilde lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml
EPIC lsr/configs/experiment/epic_msmarco_distil.yaml
DeepImpact lsr/configs/experiment/deep_impact_msmarco_distil.yaml
TILDEv2 lsr/configs/experiment/tildev2_msmarco_distil.yaml
Sparta lsr/configs/experiment/sparta_msmarco_distil.yaml
distilSplademax lsr/configs/experiment/splade_msmarco_distil_flops_0.1_0.08.yaml
distilSpladesep lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml
  • RQ3: How does the choice of encoder architecture and regularization affect results?

Results in Table 5 are the outputs of following experiments:

  • MSMARCO Passage
Effect Row Configuration
Doc weighting 1a Before: lsr/configs/experiment/splade_asm_dbin_msmarco_distil.yaml
After: lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml
1b Before: lsr/configs/experiment/unicoil_dbin_tilde_msmarco_distil.yaml
After: lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml
Query weighting 2a Before: lsr/configs/experiment/tildev2_msmarco_distil.yaml
After: lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml
2b Before: lsr/configs/experiment/epic_qbin_msmarco_distil.yaml
After: lsr/configs/experiment/epic_msmarco_distil.yaml
Doc expansion 3a Before: lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml
After: lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml
3b Before: lsr/configs/experiment/unicoil_msmarco_distil.yaml
After: lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml
Query expansion 4a Before: splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml
After: lsr/configs/experiment/splade_asm_msmarco_distil_flops_0.1_0.08.yaml
4b Before: lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml
After: lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml
Regularization 5a Before: lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml
After: lsr/configs/experiment/splade_asm_qmlp_msmarco_distil_flops_0.0_0.00.yaml
  • Tripclick
Effect Row Configuration
Doc weighting 1a Before: lsr/configs/experiment/qmlp_dbin_tripclick_multiple_negative.yaml
After: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml
1b Before: lsr/configs/experiment/qmlp_dexpbin_tripclick_multiple_negative.yaml
After: lsr/configs/experiment/unicoil_tilde_tripclick_multiple_negative.yaml
Query weighting 2a Before: lsr/configs/experiment/sparta_tripclick_multiple_negative.yaml
After: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_0.0_0.0.yaml
2b Before: lsr/configs/experiment/qbin_dmlp_tripclick_multiple_negative.yaml
After: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml
Doc expansion 3a Before: lsr/configs/experiment/qmlm_dmlp_tripclick_hard_negative_l1_0.001.yaml
After: lsr/configs/experiment/splade_asm_tripclick_multiple_negative_l1_0.001_0.00001.yaml
3b Before: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml
After: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml
Query expansion 4a Before: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml
After: lsr/configs/experiment/splade_asm_tripclick_multiple_negative_l1_0.001_0.00001.yaml
4b Before: lsr/configs/experiment/unicoil_tripclick_multiple_negative.yaml
After: lsr/configs/experiment/qmlm_dmlp_tripclick_hard_negative_l1_0.001.yaml
Regularization 5a Before: lsr/configs/experiment/epic_tripclick_multiple_negative.yaml
After: lsr/configs/experiment/qmlp_dmlm_tripclick_hard_negative_l1_0.0_0.00001.yaml

Citing and Authors

If you find this repository helpful, feel free to cite our paper A Unified Framework for Learned Sparse Retrieval

@inproceedings{nguyen2023unified,
  title={A Unified Framework for Learned Sparse Retrieval},
  author={Nguyen, Thong and MacAvaney, Sean and Yates, Andrew},
  booktitle={Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2--6, 2023, Proceedings, Part III},
  pages={101--116},
  year={2023},
  organization={Springer}
}