This repository contains the code and resources for our proposed pre-retrieval Query Performance Prediction (QPP) method that leverages nearest neighbors retrieval strategy for predicting the performance of the input query. To do so, we propose to maintain a Querystore where queries with known performances are indexed and sampled at runtime if they are the nearest neighbors of the input query. The performance of the sampled queries are used to estimate the possible performance of the new query. The framework of our propsoed Nearest Neighbor QPP (NN-QPP) method is shown below:
The table below shows Pearson Rho, kendall Tau, and Spearman correlation of different baselines as well as our proposed NN-QPP method over four different datasets.
QPP Method | MS MARCO Dev small (6980 queries) | TREC DL 2019 (43 Queries) | TREC DL 2020 (53 Queries) | DL Hard (50 Queries) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Pearson Rho | kendall Tau | Spearman | Pearson Rho | kendall Tau | Spearman | Pearson Rho | kendall Tau | Spearman | Pearson Rho | kendall Tau | Spearman | |
SCS | 0.021 | 0.058 | 0.085 | 0.471 | 0.262 | 0.354 | 0.447 | 0.310 | 0.448 | 0.247 | 0.159 | 0.240 |
P_Clarity | 0.052 | 0.007 | 0.009 | 0.109 | 0.119 | 0.139 | 0.069 | 0.052 | 0.063 | 0.095 | 0.209 | 0.272 |
VAR | 0.067 | 0.081 | 0.119 | 0.290 | 0.141 | 0.187 | 0.047 | 0.051 | 0.063 | 0.023 | 0.014 | 0.001 |
PMI | 0.030 | 0.033 | 0.048 | 0.155 | 0.065 | 0.079 | 0.021 | 0.012 | 0.003 | 0.093 | 0.027 | 0.042 |
IDF | 0.117 | 0.138 | 0.200 | 0.440 | 0.276 | 0.389 | 0.413 | 0.236 | 0.345 | 0.200 | 0.197 | 0.275 |
SCQ | 0.029 | 0.022 | 0.032 | 0.395 | 0.114 | 0.157 | 0.193 | 0.005 | 0.004 | 0.335 | 0.106 | 0.152 |
ICTF | 0.105 | 0.136 | 0.198 | 0.435 | 0.259 | 0.365 | 0.409 | 0.236 | 0.348 | 0.192 | 0.195 | 0.272 |
DC | 0.071 | 0.044 | 0.065 | 0.132 | 0.083 | 0.092 | 0.1001 | 0.1175 | 0.14913 | 0.155 | 0.091 | 0.115 |
CC | 0.085 | 0.066 | 0.076 | 0.079 | 0.068 | 0.023 | 0.172 | 0.065 | 0.089 | 0.155 | 0.093 | 0.111 |
IEF | 0.110 | 0.090 | 0.118 | 0.140 | 0.090 | 0.134 | 0.110 | 0.025 | 0.037 | 0.018 | 0.071 | 0.139 |
MRL | 0.022 | 0.046 | 0.067 | 0.176 | 0.079 | 0.140 | 0.093 | 0.078 | 0.117 | -0.046 | 0.052 | 0.038 |
NN-QPP | 0.219 | 0.214 | 0.309 | 0.483 | 0.349 | 0.508 | 0.452 | 0.319 | 0.457 | 0.364 | 0.234 | 0.340 |
The performance of NN-QPP may be impacted by the choice of (1) the base language model that is used for creating the Querystore, (2) the number of nearest neighbor samples that are retrieved per query during inference time, and (3) the size of Querystore used for finding the nearest neighbor samples. As such, we investigate their impact on the overall performance of the model. For this purpose, we adopt three different large language models, namely (1) all-mpnet-base-v2, (2) all-MiniLM-L6-v2 and (3) paraphrase-MiniLM-v2 and develop the Querystore independently for each of them and measure the performance of NN-QPP. In addition, we sample queries from the Querystore based on k = {1,3, 5, 7, 9, 10} over all the four datasets. The figures include performance based on Kendall Tau, Pearson Rho, and Spearman correlations.
In addition, we explore the impact of Querystore size on the performance of NN-QPP. To accomplish this, we employed a random sampling approach to select various percentages of queries from the pool of 500k MS MARCO queries. For each subset of queries, we construct distinct versions of the Querystore using the paraphrase-MiniLM-v2 language model. Subsequently, we evaluate the NN-QPP method on the MS MARCO Dev query dataset, utilizing the top-10 nearest neighbors sampled from each Querystore. The outcomes of these evaluations are presented in the Table below.Percentage of Queries | Pearson | Kendall | Spearman |
---|---|---|---|
50% | 0.200 | 0.191 | 0.278 |
60% | 0.200 | 0.197 | 0.286 |
70% | 0.196 | 0.199 | 0.290 |
80% | 0.216 | 0.209 | 0.302 |
90% | 0.215 | 0.207 | 0.299 |
100% | 0.219 | 0.214 | 0.309 |
1- First calculate the performance of QueryStore queries using QueryStorePerformanceCalculator.py. This code receives a set of queries and calculate their performance (i.e. MAP@1000) through anserini toolkit.
python QueryStorePerformanceCalculator.py\
-queries path to queries (TSV format) \
-anserini path to anserini \
-index path collection index \
-qrels path to qrels \
-nproc number of CPUs \
-experiment_dir experiment folder \
-queries_chunk_size chunk_size to split queries \
-hits number of docs to retrieve for those queries and caluclate performance based on
MAP@1000 score of MS MARCO queries that were used to build the QueryStore are uploaded as a pickle file named QueryStore_queries_MAP@1000.pkl.
2- In order to find the most similar queries from the QueryStore and retreived the most similar queries during the inference, we need to first index the QueryStore queries. This can be done using encode_queries.py as below:
python encode_queries.py\
-model model we want to create embeddings with (i.e. sentence-transformers/all-MiniLM-L6-v2) \
-queries path to queries we want to index (TSV format) \
-output path to output folder
3- During inferene, we can find the top_k most similar queries to a set of target queries from the QueryStore using the find_most_similar_queries.py script as below:
python find_most_similar_queries.py\
-model model we want to create embeddings with for target queries (i.e. sentence-transformers/all-MiniLM-L6-v2) \
-faiss_index path the index of QueryStore queries \
-target_queries_path path to target_queries \
-hits #number of top-k most similar matched queries to be selected
4- Finally, having the top-k most similar queries for each of the target queries, we can calculate its performance by calculating the average of performance over retreived queries performance using query_performance_predictor.py as follows:
python query_performance_predictor.py\
-top_matched_queries path to top-k matched queries from QueryStore for target queries \
-QueryStore_queries path to QueryStor queries TSV format \
-QueryStore_queries_performance #path to the pickle file containing the MAP@1000 of QueryStor queries (QueryStore_queries_MAP@1000.pkl) \
-output path to output
Post-Retrieval NN-QPP:: Estimating Query Performance Through Rich Contextualized Query Representations
The state-of-the-art query performance prediction methods rely on the fine-tuning of contextual language models to estimate retrieval effectiveness on a per-query basis. Our work in this paper builds on this strong foundation and proposes to learn rich query representations by learning the interactions between the query and two important contextual information, namely the set of documents retrieved by that query, and the set of similar historical queries with known retrieval effectiveness. We propose that such contextualized query representations can be more accurate estimators of query performance as they embed the performance of past similar queries and the semantics of the documents retrieved by the query. We perform extensive experiments on the MSMARCO collection and its accompanying query sets including MSMARCO Dev set and TREC Deep Learning tracks of 2019, 2020, 2021, and DL-Hard. Our experiments reveal that our proposed method shows robust and effective performance compared to state-of-the-art baselines.
first, you need to clone the repository:
git clone https://github.com/sadjadeb/nearest-neighbour-qpp.git
Then, you need to create a virtual environment and install the requirements:
cd nearest-neighbour-qpp/
sudo apt-get install virtualenv
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
Then, you need to download the data:
bash download_data.sh
To create a dictionary which maps each query to its actual performance by BM25 (i.e. MRR@10), you need to run the following command:
python extract_metrics_per_query.py --run /path/to/run/file --qrels /path/to/qrels/file --qrels /path/to/qrels/file
It will create a file named run-file-name_evaluation-per-query.json
in the data/eval_per_query
directory.
Then you need to create a file which contains the most similar query from train-set(a.k.a. historical queries with known retrieval effectiveness) to each query. To do so, you need to run the following command:
python find_most_similar_query.py --base_queries /path/to/train-set/queries --base_queries /path/to/train-set/queries --base_queries /path/to/train-set/queries --target_queries /path/to/desired/queries --target_queries /path/to/desired/queries --target_queries /path/to/desired/queries --model_name /name/of/the/model --model_name /name/of/the/model --model_name /name/of/the/language/model --hits /number/of/hits --hits /number/of/hits --hits /number/of/hits
Finally, to gather all data in a file to make it easier to load the data, you need to run the following commands:
python create_train_pkl_file.py
python create_test_pkl_file.py
To train the model, you need to run the following command:
python train.py
You can change the hyperparameters of the model by changing the values in the lines 9-12 of the train.py
file.
To test the model, you need to run the following command:
python test.py
To evaluate the model, you need to run the following command:
python evaluation.py --actual /path/to/actual/performance/file --predicted /path/to/predicted/performance/file --target_metric /target/metric