utility_judgments

GTI Benchmark

Download datsets with ground_truth evidence

Download NQ test data from test data of NQ; Dev data of HotpotQA from KILT; dev data of MSMACRO from msmarco and msm-qa

Dense retrieval

We directly use RocketQAv2 on wiki-based NQ and HotpotQA datsets and ADORE on web-based MSMARCO dataset.

Counterfactual passages (CP)

We use the entity substition and generation method

Highly relevant noisy passages (HRNP) and Weakly relevant noisy passages (WRNP)

Filter out results from existing retrievers that do not contain answers, and the reference is noisy passages

Candidate passages construction

We have also provided the final GTI benchmark, which you can download from link

GTU benchmark

We have also provided the final GTI benchmark, which you can download from link

Utility judgments of LLMs

Taking the testing of LlaMa 2-13B as an example, we demonstrated the use of four methods: pointwise, pairwise, list wise set, and list wise rank. If you want to test other models, you can directly replace them.

python llama2-point.py