Paper: ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
To implement ARES for scoring your RAG system and comparing to other RAG configurations, you need three components:
- A human preference validation set of annotated query, document, and answer triples for the evaluation criteria (e.g. context relevance, answer faithfulness, and/or answer relevance). There should be at least 50 examples but several hundred examples is ideal.
- A set of few-shot examples for scoring context relevance, answer faithfulness, and/or answer relevance in your system
- A much larger set of unlabeled query-document-answer triples outputted by your RAG system for scoring
The ARES training pipeline is three steps:
- Generate synthetic queries and answers from in-domain passages
- Prepare LLM judges for scoring RAG system by fine-tuning on synthetically-generated training data
- Deploy the prepared LLM judges to evaluate your RAG system across key performance metrics
Note: We also allow users to skip Steps #1 and #2 deploying a zero/few-shot LLM-as-a-Judge
To install the necessary dependencies, run the following commands:
conda create -n llm_judge python=3.10
conda activate llm_judge
pip install -r requirements.txt
Additionally, you will need to initialize an OpenAI API key with the following command:
export OPENAI_API_KEY=<your key here>
To generate synthetic training data, use LLM-as-a-Judge_Adaptation/Generate_Synthetic_Queries_and_Answers.py
. Replace items in the following command with your dataset and configuration:
python Generate_Synthetic_Queries_and_Answers.py \
--document_filepath <document_filepath> \
--few_shot_prompt_filename <few_shot_prompt_filename> \
--synthetic_queries_filename <synthetic_queries_filename> \
--documents_sampled 10000
This script will output a filepath to the generated synthetic queries for the next step.
Note: For examples files for document_filepath
and few_shot_prompt_filename
, please see example_files
.
With the generated file under synthetic_queries_filename
from the previous step, use LLM-as-a-Judge_Adaptation/General_Binary_Classifier.py
to train your LLM-as-a-Judge with the following command:
python General_Binary_Classifier.py \
--classification_dataset <synthetic queries file> \
--test_set_selection <test_set_selection> \
--label_column Context_Relevance_Label \
--num_epochs 10 \
--patience_value 3 \
--learning_rate 5e-6
For document_filepath
, put the filepath of the synthetic queries generated in the previous step. For test_set_selection
, put the filepath of the human annotated examples of your dataset; it should be formatted like the file example_files/evaluation_datasets.tsv
.
This script will output a model checkpoint path for the next step.
With the outputted model checkpoint from Step #2, you can now score your RAG system's configurations using ARES with following command in folder RAG_Automatic_Evaluation/
:
python LLMJudge_RAG_Compared_Scoring.py \
--alpha 0.05 \
--num_trials 1000 \
--evaluation_datasets <evaluation_datasets as list> \
--checkpoints <checkpoints as list> \
--labels <label columns as list> \
--GPT_scoring <True or False> \
--gold_label_path <gold_label_path>
--swap_human_labels_for_gpt_labels False
For evaluation_datasets
, we expect a list of filepaths to query-passage-answer TSVs for each RAG configuration you wish to score.
If you want to use few-shot GPT scoring, switch GPT_scoring
to True
. You can leave the checkpoints
list as blank and specify the GPT model with the tag --gpt_model <model selected>
.
Note: For examples files of evaluation_datasets
and gold_label_path
, please see example_files/evaluation_datasets.tsv
for formatting.
We include synthetic datasets for key experimental results in synthetic_datasets
. The few-shot prompts used for generation and evaluation are included in datasets_v2
. We also include instructions for fine-tuning LLM judges in the paper itself. Please reach out to jonsaadfalcon@stanford.edu if you have any further questions.
To cite our work, please use the following Bibtex: