Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems

-- Online Appendix ---

(5/21/2024: Fix: hyperlinks to directories (which did not work in the anonymized repository)

TREC Data Sets used in Experimental Evaluation

dl19: TREC DL 2019 Using 43 queries in the question-form from the Deep Learning track, harvested from search logs. The sys- tem’s task is to retrieve passages from a web collection that answer the query. The official track received 35 systems, met- rics are NDCG@10, MAP, and MRR.
dl20: TREC DL 2020 Similar setup as the previous Deep Learning track, but with 54 additional queries and 59 submitted sys- tems.
car: TREC CAR Y3 : Comprising 131 queries and 721 query sub- topics from the TREC Complex Answer Retrieval track. These were harvested from titles and section headings from school text books provided in the TQA dataset [13]. The system’s task is to retrieve Wikipedia passages to synthesize a per-query re- sponse that covers all query subtopics. Official track metrics are MAP, NDCG@20, and R-precision; of 22 systems were submitted to this track, several have identical rankings. We use 16 distinguishable systems used by Sander et al.

Below results that were presented in the manuscript in abridged form

Extended Results for DL20: dl20-extended-results/
Full DL20 leaderboard with RUBRIC-MRR including systems that generated content via GPT: results-leaderboard-with-generation-systems/>
Manual Verification for DL20 query 940547, "When did rock'n'roll begin?": dl20-manual-verification/

All experiments can be reproduced with the Autograder Workbench software.

Please see folder scripts for detailed bash scripts for reproducing results in this paper for each dataset.

We provide data produced by different phases of the RUBRIC approach

Each grade annotation appraoch is denoted as a prompt_class. Here the semantics:

The data for TREC DL cannot be redistributed under the licensing model.

Input data for CAR is provided in folder

Generated test questions amd nuggets for query-specific rubrics are in folder phase1-data/

Grade annotationsn for (question) RUBRIC, nugget-RUBRIC, and all direct grading prompts are in folder phase2-data/

Since files are too large for github, we further compress. Please uncompress xz, but keep gz compression.

We provide all generated trec_eval compatible "qrels" files in this folder phase3-qrels/

Leaderboard correlation results are found in filder results-leaderboard-correlation/

(nan's indicate that no grade with this minimum grade level is available, example binary grading prompts or Thomas for level larger than 2)