MixGR: Enhancing Retriever Generalization for Scientific Domain through Complementary Granularity

📑 Paper | 🔧 Installation | 📚 Resources | 🚀 Usage | 📄 Citing

Abstract:

Recent studies show the growing significance of document retrieval in generating LLMs within the scientific domain by bridging their knowledge gap; however, dense retrievers struggle with domain-specific retrieval and complex query-document relationships. To address these challenges, MixGR improves dense retrievers' awareness of query-document matching across various levels of granularity using a zero-shot approach, fusing various metrics into a united score for comprehensive query-document similarity. Our experiments demonstrate MixGR outperforms previous retrieval methods by 22.6% and 10.4% on nDCG@5 for unsupervised and supervised retrievers, respectively, and enhances efficacy in two downstream scientific question-answering tasks, boosting LLM applications in the scientific domain.

Installation

Install the environment based on requirements.txt:

pip install -r requirements.txt

Resources

Data

In our paper, we conduct experiment on four scineitific datasets, i.e., NFCorpus, SciDocs, SciFact, and SciQ. In the folder data/, we include:

queries including multiple subqueries and their decomposed subqueries;
the documents and their decomposed propositions;
qrels, recording the golden query-document maps.

For the procedure of subquery and proposition generation, please refer to the instruction of Query and Document Decomposition in the following section.

Propositioner

Regarding query and document decomposition, we apply the off-the-shelf tool, propositioner which is hosted on huggingface. For the environment for the usage of propositioner, please refer to the original repository.

Usage

Before all, we would suggest setting the current directory as the environment variable $ROOT_DIR. Additionally, corpus indexing and query searching depend on pyserini.

Query and Document Decomposition

For query and docuemnt decomposition, please refer to script/decomposition/query.sh and script/decomposition/document.sh, respectively. The tool itself is not 100 percent perfect. For the potential error, please refer to the error analysis in Appendix of the paper.

Index Generation

Both documents and propositions are indexed through the default framework pyserini. Additionally, we need to generate the BM25 index for both documents and propositions. This can help avoid the maintenace of the corpus in the memory, instead fetch the text from the disk.

Document/Proposition Searching

Given queries/subqueries and documents/propositions, we would calculate the quandrant combination between different granularities of queries and documents.

	Document	Proposition
Query	$s_{q\text{-}d}$: Query-Document	$s_{q\text{-}p}$: Query-Proposition
Subquery	$s_{s\text{-}d}$: Subquery-Document	$s_{s\text{-}p}$: Subquery-Proposition

Generation of the union set

As mentioned in the paper, we retrieve the top-$k$ results $R^k_{q\text{-}d}$, $R^k_{q\text{-}p}$, and $R^k_{s\text{-}p}$ by $s_{q\text{-}d}$, $s_{q\text{-}p}$, and $s_{s\text{-}p}$, respectively, where $k$ is set 200 empirically. When a query-doc pair ($\mathrm{q}'$, $\mathrm{d}'$) in one retrieval result does not exist in the other sets (e.g., $(\mathrm{q}', \mathrm{d}') \in R^k_{q\text{-}d}$ but $(\mathrm{q}', \mathrm{d}') \notin R^k_{q\text{-}p}$), we will calculate the missing similarity (e.g., $s_{q\text{-}p}(\mathrm{q}',\mathrm{d}')$) before aggregation. This process is included in scripts/rrf/union.sh.

Reciprocal rank fusion

Lastly, we merge the retrieval results from different query and document granularities through RRF.

$$ \begin{align*} s_f(\mathrm{q}, \mathrm{d}) = \frac{1}{1 + r_{q\text{-}d}(\mathrm{q}, \mathrm{d})} + \frac{1}{1 + r_{q\text{-}p}(\mathrm{q}, \mathrm{d})} + \frac{1}{1 + r_{s\text{-}p}(\mathrm{q}, \mathrm{d})}, \label{eq:rrf} \end{align*} $$

Evaluation

The resuls presented in the paper is convered by result_collection.ipynb, where the evaluation is based on the package beir.

Contact

This repository contains experimental software intended to provide supplementary details for the corresponding publication. If you encounter any issues, please contact Fengyu Cai.

Licence

The software in this repocitory is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

Citing

@misc{cai2024textttmixgr,
    title={$\texttt{MixGR}$: Enhancing Retriever Generalization for Scientific Domain through Complementary Granularity},
    author={Fengyu Cai and Xinran Zhao and Tong Chen and Sihao Chen and Hongming Zhang and Iryna Gurevych and Heinz Koeppl},
    year={2024},
    eprint={2407.10691},
    archivePrefix={arXiv},
    primaryClass={cs.IR}
}

TRUMANCFY/MixGR