This project applies the LLM-Eval framework (LLM-Eval: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models" by Lin et al. (2023)) to the PersonaChat dataset to assess response quality in a conversational context. Using GPT-4o-mini via the OpenAI API, the system generates scores (on a 0-5 or 0-100 scale) for four evaluation metrics: context, grammar, relevance, and appropriateness. The generated scores are then compared with human annotations, and the correlation is measured using Pearson, Spearman, and Kendallβs Tau coefficients.
# Create a new environment
python3 -m venv .venv
# Activate the environment
source .venv/bin/activate
# Install the requirements
pip install -r requirements.txt
Copy the .env.example
file to .env
and fill in the necessary information.
In order to produce the results of the paper, you need to run the following commands:
- Run the experiments
python3 run_batch.py -a 0 -b 5 -m gpt-4o-mini # Replicates the results using scores [from 0 to 5]
# - or -
python3 run_batch.py -a 0 -b 100 -m gpt-4o-mini # Replicates the results using scores [from 0 to 100]
You can specify any model available on the OpenAI API. A list of models can be found here.
These commands will create a batch file with the input data and send it to the OpenAI API. The output of the command will look like this:
π’οΈ Loading dataset from pc_usr_data.json
π Stored requests in batchinput.jsonl
Input File Details
π Name batchinput.jsonl
π ID file-[FILE_IDENTIFIER]
πΎ Bytes 812411
π Status processed
Batch Details
π ID batch_[BATCH_IDENTIFIER]
β° Completion Window 24h
π Endpoint /v1/chat/completions
π Input File ID file-[FILE_IDENTIFIER]
π Status validating
In order to retrieve the results, you should copy the batch_[BATCH_IDENTIFIER]
and run the following command:
- Load the results
python3 load_batch.py batch_[BATCH_IDENTIFIER]
Note thath, the command output will present a batch progress if the batch is still processing.
Additionally, the batch identifier can also be found on OpenAI's Platform.
The output of the above command will look like this:
===== Pearson =====
content grammar relevance appropriateness
avg_understandable 0.215955 0.205903 0.243477 0.219548
avg_natural 0.441733 0.452959 0.418004 0.458911
avg_maintains_context 0.484232 0.087857 0.545199 0.554493
avg_engaging 0.338908 0.094160 0.318732 0.317295
avg_uses_knowledge 0.522484 0.103329 0.550135 0.502862
===== Kendall =====
content grammar relevance appropriateness
avg_understandable 0.214378 0.244810 0.220963 0.225096
avg_natural 0.394293 0.345629 0.364810 0.406473
avg_maintains_context 0.398185 0.057896 0.427690 0.441805
avg_engaging 0.332145 0.063244 0.292944 0.290001
avg_uses_knowledge 0.463811 0.095284 0.475252 0.445239
===== Spearman =====
content grammar relevance appropriateness
avg_understandable 0.240211 0.255376 0.248032 0.248370
avg_natural 0.458765 0.374023 0.431535 0.465314
avg_maintains_context 0.481184 0.065730 0.515119 0.520318
avg_engaging 0.393004 0.073037 0.348444 0.342798
avg_uses_knowledge 0.536816 0.102409 0.551348 0.503047