In this repository we include the prompts, codes and generated responses for the paper "A Comparison of Methods for Evaluating Generative IR".
Prompts
includes all the prmopts we used for our experiments.
generated responses
directory includes all the regular generated responses as well as liar responses we used for our experiments by four different LLMS including:
- gpt-4
- gpt-3.5-turbo
- llama2 7b chat
- llama2 13b chat LLMs.
data
includes the queries and relevance judgements used for our experiments from Trec DL 2019 and Trec DL 2020.
genir experiments
include the results in evaluating LLM generated responses using 5 different evaluation methods described in the paper.
Validations
include the resuls of compaing qrels from different level of relevance in TREC dl 2019 and 2020.
The results for experiments conducted in Section 3 and 4 can be found under Validations
and Genir experiments
respectively.