This repository contains the source code for the following paper:
A Comprehensive Assessment of Dialog Evaluation Metrics
We use conda to mangage environments for different metrics.
Each directory in conda_envs
holds an environment specification. Please install all of them before starting the next step.
Take the installation of conda_envs/eval_base
for example, please run
conda env create -f conda_envs/eval_base/environment.yml
Note that there are some packages could not be installed via this method.
If you want find any packages such as bleurt, nlg-eval, and packages downloaded by spaCy are missing, please install it with official instructions.
We apologize for any inconvenience.
The directory of each qualitiy-annotated data is placed in data
, with the data_loader.py
for parsing the data.
Please follow the below instructions to downlaod each dataset, place it into corresponding directory, and run the data_loader.py
directly to see if you use the correct data.
Download human_rating_scores.txt
from https://www.dropbox.com/s/oh1trbos0tjzn7t/dstc6_t2_evaluation.tgz .
Download and Place the data directory https://github.com/ictnlp/DialoFlow/tree/main/FlowScore/data into data/dstc9_data
.
Download https://github.com/PlusLabNLP/PredictiveEngagement/blob/master/data/Eng_Scores_queries_gen_gtruth_replies.csv and rename it to engage_all.csv
.
Download http://shikib.com/fed_data.json .
Download and place each directory in https://github.com/li3cmz/GRADE/tree/main/evaluation/eval_data as data/grade_data/[convai2|dailydialog|empatheticdialogues]
.
Also download the human_score.txt
in https://github.com/li3cmz/GRADE/tree/main/evaluation/human_score into the corresponding data/grade_data/[convai2|dailydialog|empatheticdialogues]
.
Download context_data_release.csv
and fluency_data_release.csv
from https://github.com/alexzhou907/dialogue_evaluation .
Download TopicalChat and PersonaChat data from http://shikib.com/usr
For baselines, we use the nlg-eval. Please folloow the instruction to install it.
For each dialog metrics, please follow the instructions in README in the corresponding directory.
PredictiveEngage, BERT-RUBER and PONE requires the running bert-as-service.
If you want to evaluate them, please install and run bert-as-service following the instrucitons here.
We also provide a script we used to run bert-as-service run_bert_as_service.sh
, feel free to use it.
We used a web server for running USR and FED in our experiments.
Please modify path in usr_fed/usr/usr_server.py
and usr_fed/fed/fed_server.py
to start the server, and modify the path in usr_fed_metric.py
.
-
After you downlaod all datasets, run
gen_data.py
to transform all datasets into the input format for all metrics. If you only want to evaluate metricmetric
and datasetdataset
, run withgen_data.py --source_data dataset --target_format metric
-
Modify the path in
run_eval.sh
as specified in the script, since we need to activate Conda environment when running the script. Runeval_metrics.sh
to evaluate all quality-anntoated data. -
Some metrics generate the output in its special format. Therefore, we should run
read_result.py
to read the results of those metrics and transform it intooutputs
. As step 1, you can specify the metric and data byread_result.py --metric metric --eval_data dataset
. -
The
outputs/METRIC/DATA/results.json
holds the prediction score of each metrics (METRIC) and qualitiy-anntoated data (DATA), while runningdata_loader.py
directly in each data directory also generates the corresponding human scores. You can perform any analysis with the data (The jupyter notebook used in our analysis will be released) .
For example, outputs/grade/dstc9_data/results.json
could be
'GRADE': # the metric name
[
0.2568123, # the score of the first sample
0.1552132,
...
0.7812346
]
All values are statistically significant to p-value < 0.05, unless marked by *.
USR-TopicalChat | USR-Pearsonachat | |||||||
Turn-Level | System-Level | Turn-Level | System-Level | |||||
P | S | P | S | P | S | P | S | |
BLEU-4 | 0.216 | 0.296 | 0.874* | 0.900 | 0.135 | 0.090* | 0.841* | 0.800* |
METEOR | 0.336 | 0.391 | 0.943 | 0.900 | 0.253 | 0.271 | 0.907* | 0.800* |
ROUGE-L | 0.275 | 0.287 | 0.814* | 0.900 | 0.066* | 0.038* | 0.171* | 0.000* |
ADEM | -0.060* | -0.061* | 0.202* | 0.700* | -0.141 | -0.085* | 0.523* | 0.400* |
BERTScore | 0.298 | 0.325 | 0.854* | 0.900 | 0.152 | 0.122* | 0.241* | 0.000* |
BLEURT | 0.216 | 0.261 | 0.630* | 0.900 | 0.065* | 0.054* | -0.125* | 0.000* |
QuestEval | 0.300 | 0.338 | 0.943 | 1.000 | 0.176 | 0.236 | 0.885* | 1.000 |
RUBER | 0.247 | 0.259 | 0.876* | 1.000 | 0.131 | 0.190 | 0.997 | 1.000 |
BERT-RUBER | 0.342 | 0.348 | 0.992 | 0.900 | 0.266 | 0.248 | 0.958 | 0.200* |
PONE | 0.271 | 0.274 | 0.893 | 0.500* | 0.373 | 0.375 | 0.979 | 0.800* |
MAUDE | 0.044* | 0.083* | 0.317* | -0.200* | 0.345 | 0.298 | 0.440* | 0.400* |
DEB | 0.180 | 0.116 | 0.818* | 0.400* | 0.291 | 0.373 | 0.989 | 1.000 |
GRADE | 0.200 | 0.217 | 0.553* | 0.100* | 0.358 | 0.352 | 0.811* | 1.000 |
DynaEval | -0.032* | -0.022* | -0.248* | 0.100* | 0.149 | 0.171 | 0.584* | 0.800* |
USR | 0.412 | 0.423 | 0.967 | 0.900 | 0.440 | 0.418 | 0.864* | 1.000 |
USL-H | 0.322 | 0.340 | 0.966 | 0.900 | 0.495 | 0.523 | 0.969 | 0.800* |
DialogRPT | 0.120 | 0.105* | 0.944 | 0.600* | -0.064* | -0.083* | 0.347* | 0.800* |
Deep AM-FM | 0.285 | 0.268 | 0.969 | 0.700* | 0.228 | 0.219 | 0.965 | 1.000 |
HolisticEval | -0.147 | -0.123 | -0.919 | -0.200* | 0.087* | 0.113* | 0.051* | 0.000* |
PredictiveEngage | 0.222 | 0.310 | 0.870* | 0.900 | -0.003* | 0.033* | 0.683* | 0.200* |
FED | -0.124 | -0.135 | 0.730* | 0.100* | -0.028* | -0.000* | 0.005* | 0.400* |
FlowScore | 0.095* | 0.082* | -0.150* | 0.400* | 0.118* | 0.079* | 0.678* | 0.800* |
FBD | - | - | 0.916 | 0.100* | - | - | 0.644* | 0.800* |
GRADE-ConvAI2 | GRADE-DailyDialog | GRADE-EmpatheticDialogue | ||||||||||
Turn-Level | System-Level | Turn-Level | System-Level | Turn-Level | System-Level | |||||||
P | S | P | S | P | S | P | S | P | S | P | S | |
BLEU-4 | 0.003* | 0.128 | 0.034* | 0.000* | 0.075* | 0.184 | 1.000* | 1.000 | -0.051* | 0.002* | 1.000* | 1.000 |
METEOR | 0.145 | 0.181 | 0.781* | 0.600* | 0.096* | 0.010* | -1.000* | -1.000 | 0.118 | 0.055* | 1.000* | 1.000 |
ROUGE-L | 0.136 | 0.140 | 0.209* | 0.000* | 0.154 | 0.147 | 1.000* | 1.000 | 0.029* | -0.013* | 1.000* | 1.000 |
ADEM | -0.060* | -0.057* | -0.368* | -0.200* | 0.064* | 0.071* | 1.000* | 1.000 | -0.036* | -0.028* | 1.000* | 1.000 |
BERTScore | 0.225 | 0.224 | 0.918* | 0.800* | 0.129 | 0.100* | -1.000* | -1.000 | 0.046* | 0.033* | 1.000* | 1.000 |
BLEURT | 0.125 | 0.120 | -0.777* | -0.400* | 0.176 | 0.133 | 1.000* | 1.000 | 0.087* | 0.051* | 1.000* | 1.000 |
QuestEval | 0.279 | 0.319 | 0.283* | 0.400* | 0.020* | 0.006* | -1.000* | -1.000 | 0.201 | 0.272 | 1.000* | 1.000 |
RUBER | -0.027* | -0.042* | -0.458* | -0.400* | -0.084* | -0.094* | -1.000* | -1.000 | -0.078* | -0.039* | 1.000* | 1.000 |
BERT-RUBER | 0.309 | 0.314 | 0.885* | 1.000 | 0.134 | 0.128 | -1.000* | -1.000 | 0.163 | 0.148 | 1.000* | 1.000 |
PONE | 0.362 | 0.373 | 0.816* | 0.800* | 0.163 | 0.163 | -1.000* | -1.000 | 0.177 | 0.161 | 1.000* | 1.000 |
MAUDE | 0.351 | 0.304 | 0.748* | 0.800* | -0.036* | -0.073* | 1.000* | 1.000 | 0.007* | -0.057* | 1.000* | 1.000 |
DEB | 0.426 | 0.504 | 0.995 | 1.000 | 0.337 | 0.363 | 1.000* | 1.000 | 0.356 | 0.395 | 1.000* | 1.000 |
GRADE | 0.566 | 0.571 | 0.883* | 0.800* | 0.278 | 0.253 | -1.000* | -1.000 | 0.330 | 0.297 | 1.000* | 1.000 |
DynaEval | 0.138 | 0.131 | -0.996 | -1.000 | 0.108* | 0.120 | -1.000* | -1.000 | 0.146 | 0.141 | -1.000* | -1.000 |
USR | 0.501 | 0.500 | 0.995 | 1.000 | 0.057* | 0.057* | -1.000* | -1.000 | 0.264 | 0.255 | 1.000* | 1.000 |
USL-H | 0.443 | 0.457 | 0.971 | 1.000 | 0.108* | 0.093* | -1.000* | -1.000 | 0.293 | 0.235 | 1.000* | 1.000 |
DialogRPT | 0.137 | 0.158 | -0.311* | -0.600* | -0.000* | 0.037* | -1.000* | -1.000 | 0.211 | 0.203 | 1.000* | 1.000 |
Deep AM-FM | 0.117 | 0.130 | 0.774* | 0.400* | 0.026* | 0.022* | 1.000* | 1.000 | 0.083* | 0.058* | 1.000* | 1.000 |
HolisticEval | -0.030* | -0.010* | -0.297* | -0.400* | 0.025* | 0.020* | 1.000* | 1.000 | 0.199 | 0.204 | -1.000* | -1.000 |
PredictiveEngage | 0.154 | 0.164 | 0.601* | 0.600* | -0.133 | -0.135 | -1.000* | -1.000 | -0.032* | -0.078* | 1.000* | 1.000 |
FED | -0.090 | -0.072* | -0.254* | 0.000* | 0.080* | 0.064* | 1.000* | 1.000 | -0.014* | -0.044* | 1.000* | 1.000 |
FlowScore | - | - | - | - | - | - | - | - | - | - | - | - |
FBD | - | - | -0.235* | -0.400* | - | - | -1.000* | -1.000 | - | - | -1.000* | -1.000 |
DSTC6 | ||||
Turn-Level | System-Level | |||
P | S | P | S | |
BLEU-4 | 0.131 | 0.298 | -0.064* | 0.050* |
METEOR | 0.307 | 0.323 | 0.633 | 0.084* |
ROUGE-L | 0.332 | 0.326 | 0.487 | 0.215* |
ADEM | 0.151 | 0.118 | 0.042* | 0.347* |
BERTScore | 0.369 | 0.337 | 0.671 | 0.265* |
BLEURT | 0.326 | 0.294 | 0.213* | 0.426* |
QuestEval | 0.188 | 0.242 | -0.215* | 0.206* |
RUBER | 0.114 | 0.092 | -0.074* | 0.104* |
BERT-RUBER | 0.204 | 0.217 | 0.825 | 0.093* |
PONE | 0.208 | 0.200 | 0.608 | 0.235* |
MAUDE | 0.195 | 0.128 | 0.739 | 0.217* |
DEB | 0.211 | 0.214 | -0.261* | 0.492 |
GRADE | 0.119 | 0.122 | 0.784 | 0.611 |
DynaEval | 0.286 | 0.246 | 0.342* | -0.050* |
USR | 0.184 | 0.166 | 0.432* | 0.147* |
USL-H | 0.217 | 0.179 | 0.811 | 0.298* |
DialogRPT | 0.170 | 0.155 | 0.567 | 0.334* |
Deep AM-FM | 0.326 | 0.295 | 0.817 | 0.674 |
HolisticEval | 0.001* | -0.004* | 0.010 | -0.002 |
PredictiveEngage | 0.043 | 0.004* | -0.094* | -0.409* |
FED | -0.106 | -0.083 | 0.221* | 0.322* |
FlowScore | 0.064 | 0.095 | 0.352* | 0.362* |
FBD | - | - | -0.481 | -0.234* |
PredictiveEngage-DailyDialog | ||
Turn-Level | ||
P | S | |
QuestEval | 0.296 | 0.341 |
MAUDE | 0.104 | 0.060* |
DEB | 0.516 | 0.580 |
GRADE | 0.600 | 0.622 |
DynaEval | 0.167 | 0.160 |
USR | 0.582 | 0.640 |
USL-H | 0.688 | 0.699 |
DialogRPT | 0.489 | 0.533 |
HolisticEval | 0.368 | 0.365 |
PredictiveEngage | 0.429 | 0.414 |
FED | 0.164 | 0.159 |
FlowScore | - | - |
FBD | - | - |
HolisticEval-DailyDialog | ||
Turn-Level | ||
P | S | |
QuestEval | 0.285 | 0.260 |
MAUDE | 0.275 | 0.364 |
DEB | 0.584 | 0.663 |
GRADE | 0.678 | 0.697 |
DynaEval | -0.023* | -0.009* |
USR | 0.589 | 0.645 |
USL-H | 0.486 | 0.537 |
DialogRPT | 0.283 | 0.332 |
HolisticEval | 0.670 | 0.764 |
PredictiveEngage | -0.033* | 0.060* |
FED | 0.485 | 0.507 |
FlowScore | - | - |
FBD | - | - |
FED | ||||
Turn-Level | Dialog-Level | |||
P | S | P | S | |
QuestEval | 0.037* | 0.093* | -0.032* | 0.080* |
MAUDE | 0.018* | -0.094* | -0.047* | -0.280 |
DEB | 0.230 | 0.187 | -0.130* | 0.006* |
GRADE | 0.134 | 0.118 | -0.034* | -0.065* |
DynaEval | 0.319 | 0.323 | 0.503 | 0.547 |
USR | 0.114 | 0.117 | 0.093* | 0.062* |
USL-H | 0.201 | 0.189 | 0.073* | 0.152* |
DialogRPT | -0.118 | -0.086* | -0.221 | -0.214 |
HolisticEval | 0.122 | 0.125 | -0.276 | -0.304 |
PredictiveEngage | 0.024* | 0.094* | 0.026* | 0.155* |
FED | 0.120 | 0.095 | 0.222 | 0.320 |
FlowScore | -0.065* | -0.055* | -0.073* | -0.003* |
FBD | - | - | - | - |
DSTC9 | ||||
Dialog-Level | System-Level | |||
P | S | P | S | |
QuestEval | 0.026* | 0.043 | 0.604 | 0.527* |
MAUDE | 0.059 | 0.042* | 0.224* | 0.045* |
DEB | 0.085 | 0.131 | 0.683 | 0.473* |
GRADE | -0.078 | -0.070 | -0.674 | -0.482* |
DynaEval | 0.093 | 0.101 | 0.652 | 0.727 |
USR | 0.019* | 0.020* | 0.149* | 0.127* |
USL-H | 0.105 | 0.105 | 0.566* | 0.755 |
DialogRPT | 0.076 | 0.069 | 0.685 | 0.555* |
HolisticEval | 0.015* | 0.002* | -0.019* | -0.100* |
PredictiveEngage | 0.114 | 0.115 | 0.809 | 0.664 |
FED | 0.128 | 0.120 | 0.559* | 0.391* |
FlowScore | 0.147 | 0.140 | 0.907 | 0.900 |
FBD | - | - | -0.669 | -0.627 |
Let the name of the new dataset be sample
Create a directory data/sample_data
, write a function load_sample_data
as follow:
def load_sample_data(base_dir: str):
'''
Args:
base_dir: the absolute path to data/sample_data
Return:
Dict:
{
# the required items
'contexts' : List[List[str]], # dialog context. We split each dialog context by turns. Therefore one dialog context is in type List[str].
'responses': List[str], # dialog response.
'references': List[str], # dialog references. If no reference in the data, please still give a dummy reference like "NO REF".
"scores": List[float] # human scores.
# add any customized items
"Customized Item": List[str] # any additional info in the data.
}
'''
Import the function in gen_data.py
, and run with python gen_data.py --source_data sample
Let the name of the new metric be metric
Write a function gen_metric_data
to transform and generate the data into the metric directory:
# input format 1
def gen_metric_data(data: Dict, output_path: str):
'''
Args:
data: the return value of load_data functions e.g. {'contexts': ...}
output_path: path to the output file
'''
# input format 2
def gen_metric_data(data: Dict, base_dir: str, dataset: str):
'''
Args:
data: the return value of load_data functions e.g. {'contexts': ...}
base_dir: path to the output directory
dataset: name of the dataset
'''
We have two input formats. Just follow the one which is easier for you.
Import the function in gen_data.py
and follow comments in the code to add the metric.
Then write a function read_metric_result
to read the prediction of the metric:
def read_metric_data(data_path: str):
'''
Args:
data_path: path to the prediction file
Return:
# You can choose to return list or dict
List: metric scores e.g. [0.2, 0.3, 0.4, ...]
or
Dict: {'metric': List # metric scores}
'''
Import the function in read_result.py
and follow comments in the code to add the metric.
Then just follow the previous evaluation instructions to evaluate the metric.