A Comprehensive Assessment of Dialog Evaluation Metrics

This repository contains the source code for the following paper:

A Comprehensive Assessment of Dialog Evaluation Metrics

Prerequisties

We use conda to mangage environments for different metrics.

Each directory in conda_envs holds an environment specification. Please install all of them before starting the next step.

Take the installation of conda_envs/eval_base for example, please run

conda env create -f conda_envs/eval_base/environment.yml

Note that there are some packages could not be installed via this method.

If you want find any packages such as bleurt, nlg-eval, and packages downloaded by spaCy are missing, please install it with official instructions.

We apologize for any inconvenience.

Data Preparation

The directory of each qualitiy-annotated data is placed in data, with the data_loader.py for parsing the data.

Please follow the below instructions to downlaod each dataset, place it into corresponding directory, and run the data_loader.py directly to see if you use the correct data.

DSTC6 Data

Download human_rating_scores.txt from https://www.dropbox.com/s/oh1trbos0tjzn7t/dstc6_t2_evaluation.tgz .

DSTC9 Data

Download and Place the data directory https://github.com/ictnlp/DialoFlow/tree/main/FlowScore/data into data/dstc9_data.

Engage Data

Download https://github.com/PlusLabNLP/PredictiveEngagement/blob/master/data/Eng_Scores_queries_gen_gtruth_replies.csv and rename it to engage_all.csv.

Fed Data

Download http://shikib.com/fed_data.json .

Grade Data

Download and place each directory in https://github.com/li3cmz/GRADE/tree/main/evaluation/eval_data as data/grade_data/[convai2|dailydialog|empatheticdialogues].

Also download the human_score.txt in https://github.com/li3cmz/GRADE/tree/main/evaluation/human_score into the corresponding data/grade_data/[convai2|dailydialog|empatheticdialogues].

Holistic Data

Download context_data_release.csv and fluency_data_release.csv from https://github.com/alexzhou907/dialogue_evaluation .

USR Data

Download TopicalChat and PersonaChat data from http://shikib.com/usr

Metric Installation

For baselines, we use the nlg-eval. Please folloow the instruction to install it.

For each dialog metrics, please follow the instructions in README in the corresponding directory.

Running Notes for Specific metrics

bert-as-service

PredictiveEngage, BERT-RUBER and PONE requires the running bert-as-service.

If you want to evaluate them, please install and run bert-as-service following the instrucitons here.

We also provide a script we used to run bert-as-service run_bert_as_service.sh, feel free to use it.

running USR and FED

We used a web server for running USR and FED in our experiments.

Please modify path in usr_fed/usr/usr_server.py and usr_fed/fed/fed_server.py to start the server, and modify the path in usr_fed_metric.py.

How to evaluate

  1. After you downlaod all datasets, run gen_data.py to transform all datasets into the input format for all metrics. If you only want to evaluate metric metric and dataset dataset, run with gen_data.py --source_data dataset --target_format metric

  2. Modify the path in run_eval.sh as specified in the script, since we need to activate Conda environment when running the script. Run eval_metrics.sh to evaluate all quality-anntoated data.

  3. Some metrics generate the output in its special format. Therefore, we should run read_result.py to read the results of those metrics and transform it into outputs. As step 1, you can specify the metric and data by read_result.py --metric metric --eval_data dataset.

  4. The outputs/METRIC/DATA/results.json holds the prediction score of each metrics (METRIC) and qualitiy-anntoated data (DATA), while running data_loader.py directly in each data directory also generates the corresponding human scores. You can perform any analysis with the data (The jupyter notebook used in our analysis will be released) .

For example, outputs/grade/dstc9_data/results.json could be


    'GRADE': # the metric name
    [
        0.2568123, # the score of the first sample
        0.1552132, 
        ...
        0.7812346
    ]

Results

All values are statistically significant to p-value < 0.05, unless marked by *.

USR Data

USR-TopicalChat USR-Pearsonachat
Turn-Level System-Level Turn-Level System-Level
P S P S P S P S
BLEU-4 0.216 0.296 0.874* 0.900 0.135 0.090* 0.841* 0.800*
METEOR 0.336 0.391 0.943 0.900 0.253 0.271 0.907* 0.800*
ROUGE-L 0.275 0.287 0.814* 0.900 0.066* 0.038* 0.171* 0.000*
ADEM -0.060* -0.061* 0.202* 0.700* -0.141 -0.085* 0.523* 0.400*
BERTScore 0.298 0.325 0.854* 0.900 0.152 0.122* 0.241* 0.000*
BLEURT 0.216 0.261 0.630* 0.900 0.065* 0.054* -0.125* 0.000*
QuestEval 0.300 0.338 0.943 1.000 0.176 0.236 0.885* 1.000
RUBER 0.247 0.259 0.876* 1.000 0.131 0.190 0.997 1.000
BERT-RUBER 0.342 0.348 0.992 0.900 0.266 0.248 0.958 0.200*
PONE 0.271 0.274 0.893 0.500* 0.373 0.375 0.979 0.800*
MAUDE 0.044* 0.083* 0.317* -0.200* 0.345 0.298 0.440* 0.400*
DEB 0.180 0.116 0.818* 0.400* 0.291 0.373 0.989 1.000
GRADE 0.200 0.217 0.553* 0.100* 0.358 0.352 0.811* 1.000
DynaEval -0.032* -0.022* -0.248* 0.100* 0.149 0.171 0.584* 0.800*
USR 0.412 0.423 0.967 0.900 0.440 0.418 0.864* 1.000
USL-H 0.322 0.340 0.966 0.900 0.495 0.523 0.969 0.800*
DialogRPT 0.120 0.105* 0.944 0.600* -0.064* -0.083* 0.347* 0.800*
Deep AM-FM 0.285 0.268 0.969 0.700* 0.228 0.219 0.965 1.000
HolisticEval -0.147 -0.123 -0.919 -0.200* 0.087* 0.113* 0.051* 0.000*
PredictiveEngage 0.222 0.310 0.870* 0.900 -0.003* 0.033* 0.683* 0.200*
FED -0.124 -0.135 0.730* 0.100* -0.028* -0.000* 0.005* 0.400*
FlowScore 0.095* 0.082* -0.150* 0.400* 0.118* 0.079* 0.678* 0.800*
FBD - - 0.916 0.100* - - 0.644* 0.800*

GRADE Data

GRADE-ConvAI2 GRADE-DailyDialog GRADE-EmpatheticDialogue
Turn-Level System-Level Turn-Level System-Level Turn-Level System-Level
P S P S P S P S P S P S
BLEU-4 0.003* 0.128 0.034* 0.000* 0.075* 0.184 1.000* 1.000 -0.051* 0.002* 1.000* 1.000
METEOR 0.145 0.181 0.781* 0.600* 0.096* 0.010* -1.000* -1.000 0.118 0.055* 1.000* 1.000
ROUGE-L 0.136 0.140 0.209* 0.000* 0.154 0.147 1.000* 1.000 0.029* -0.013* 1.000* 1.000
ADEM -0.060* -0.057* -0.368* -0.200* 0.064* 0.071* 1.000* 1.000 -0.036* -0.028* 1.000* 1.000
BERTScore 0.225 0.224 0.918* 0.800* 0.129 0.100* -1.000* -1.000 0.046* 0.033* 1.000* 1.000
BLEURT 0.125 0.120 -0.777* -0.400* 0.176 0.133 1.000* 1.000 0.087* 0.051* 1.000* 1.000
QuestEval 0.279 0.319 0.283* 0.400* 0.020* 0.006* -1.000* -1.000 0.201 0.272 1.000* 1.000
RUBER -0.027* -0.042* -0.458* -0.400* -0.084* -0.094* -1.000* -1.000 -0.078* -0.039* 1.000* 1.000
BERT-RUBER 0.309 0.314 0.885* 1.000 0.134 0.128 -1.000* -1.000 0.163 0.148 1.000* 1.000
PONE 0.362 0.373 0.816* 0.800* 0.163 0.163 -1.000* -1.000 0.177 0.161 1.000* 1.000
MAUDE 0.351 0.304 0.748* 0.800* -0.036* -0.073* 1.000* 1.000 0.007* -0.057* 1.000* 1.000
DEB 0.426 0.504 0.995 1.000 0.337 0.363 1.000* 1.000 0.356 0.395 1.000* 1.000
GRADE 0.566 0.571 0.883* 0.800* 0.278 0.253 -1.000* -1.000 0.330 0.297 1.000* 1.000
DynaEval 0.138 0.131 -0.996 -1.000 0.108* 0.120 -1.000* -1.000 0.146 0.141 -1.000* -1.000
USR 0.501 0.500 0.995 1.000 0.057* 0.057* -1.000* -1.000 0.264 0.255 1.000* 1.000
USL-H 0.443 0.457 0.971 1.000 0.108* 0.093* -1.000* -1.000 0.293 0.235 1.000* 1.000
DialogRPT 0.137 0.158 -0.311* -0.600* -0.000* 0.037* -1.000* -1.000 0.211 0.203 1.000* 1.000
Deep AM-FM 0.117 0.130 0.774* 0.400* 0.026* 0.022* 1.000* 1.000 0.083* 0.058* 1.000* 1.000
HolisticEval -0.030* -0.010* -0.297* -0.400* 0.025* 0.020* 1.000* 1.000 0.199 0.204 -1.000* -1.000
PredictiveEngage 0.154 0.164 0.601* 0.600* -0.133 -0.135 -1.000* -1.000 -0.032* -0.078* 1.000* 1.000
FED -0.090 -0.072* -0.254* 0.000* 0.080* 0.064* 1.000* 1.000 -0.014* -0.044* 1.000* 1.000
FlowScore - - - - - - - - - - - -
FBD - - -0.235* -0.400* - - -1.000* -1.000 - - -1.000* -1.000

DSTC6 Data

DSTC6
Turn-Level System-Level
P S P S
BLEU-4 0.131 0.298 -0.064* 0.050*
METEOR 0.307 0.323 0.633 0.084*
ROUGE-L 0.332 0.326 0.487 0.215*
ADEM 0.151 0.118 0.042* 0.347*
BERTScore 0.369 0.337 0.671 0.265*
BLEURT 0.326 0.294 0.213* 0.426*
QuestEval 0.188 0.242 -0.215* 0.206*
RUBER 0.114 0.092 -0.074* 0.104*
BERT-RUBER 0.204 0.217 0.825 0.093*
PONE 0.208 0.200 0.608 0.235*
MAUDE 0.195 0.128 0.739 0.217*
DEB 0.211 0.214 -0.261* 0.492
GRADE 0.119 0.122 0.784 0.611
DynaEval 0.286 0.246 0.342* -0.050*
USR 0.184 0.166 0.432* 0.147*
USL-H 0.217 0.179 0.811 0.298*
DialogRPT 0.170 0.155 0.567 0.334*
Deep AM-FM 0.326 0.295 0.817 0.674
HolisticEval 0.001* -0.004* 0.010 -0.002
PredictiveEngage 0.043 0.004* -0.094* -0.409*
FED -0.106 -0.083 0.221* 0.322*
FlowScore 0.064 0.095 0.352* 0.362*
FBD - - -0.481 -0.234*

PredictiveEngage-DailyDialog

PredictiveEngage-DailyDialog
Turn-Level
P S
QuestEval 0.296 0.341
MAUDE 0.104 0.060*
DEB 0.516 0.580
GRADE 0.600 0.622
DynaEval 0.167 0.160
USR 0.582 0.640
USL-H 0.688 0.699
DialogRPT 0.489 0.533
HolisticEval 0.368 0.365
PredictiveEngage 0.429 0.414
FED 0.164 0.159
FlowScore - -
FBD - -

HolisticEval-DailyDialog

HolisticEval-DailyDialog
Turn-Level
P S
QuestEval 0.285 0.260
MAUDE 0.275 0.364
DEB 0.584 0.663
GRADE 0.678 0.697
DynaEval -0.023* -0.009*
USR 0.589 0.645
USL-H 0.486 0.537
DialogRPT 0.283 0.332
HolisticEval 0.670 0.764
PredictiveEngage -0.033* 0.060*
FED 0.485 0.507
FlowScore - -
FBD - -

FED Data

FED
Turn-Level Dialog-Level
P S P S
QuestEval 0.037* 0.093* -0.032* 0.080*
MAUDE 0.018* -0.094* -0.047* -0.280
DEB 0.230 0.187 -0.130* 0.006*
GRADE 0.134 0.118 -0.034* -0.065*
DynaEval 0.319 0.323 0.503 0.547
USR 0.114 0.117 0.093* 0.062*
USL-H 0.201 0.189 0.073* 0.152*
DialogRPT -0.118 -0.086* -0.221 -0.214
HolisticEval 0.122 0.125 -0.276 -0.304
PredictiveEngage 0.024* 0.094* 0.026* 0.155*
FED 0.120 0.095 0.222 0.320
FlowScore -0.065* -0.055* -0.073* -0.003*
FBD - - - -

DSTC9 Data

DSTC9
Dialog-Level System-Level
P S P S
QuestEval 0.026* 0.043 0.604 0.527*
MAUDE 0.059 0.042* 0.224* 0.045*
DEB 0.085 0.131 0.683 0.473*
GRADE -0.078 -0.070 -0.674 -0.482*
DynaEval 0.093 0.101 0.652 0.727
USR 0.019* 0.020* 0.149* 0.127*
USL-H 0.105 0.105 0.566* 0.755
DialogRPT 0.076 0.069 0.685 0.555*
HolisticEval 0.015* 0.002* -0.019* -0.100*
PredictiveEngage 0.114 0.115 0.809 0.664
FED 0.128 0.120 0.559* 0.391*
FlowScore 0.147 0.140 0.907 0.900
FBD - - -0.669 -0.627

How to Add New Dataset

Let the name of the new dataset be sample

Create a directory data/sample_data, write a function load_sample_data as follow:

def load_sample_data(base_dir: str):
    '''
    Args: 
        base_dir: the absolute path to data/sample_data
    Return:
        Dict:
        {
            # the required items
            'contexts' : List[List[str]], # dialog context. We split each dialog context by turns. Therefore one dialog context is in type List[str].
            'responses': List[str], # dialog response.
            'references': List[str], # dialog references. If no reference in the data, please still give a dummy reference like "NO REF".
            "scores": List[float] # human scores.
            # add any customized items
            "Customized Item": List[str] # any additional info in the data. 
        }
    '''

Import the function in gen_data.py, and run with python gen_data.py --source_data sample

How to Add New Metrics

Let the name of the new metric be metric

Write a function gen_metric_data to transform and generate the data into the metric directory:

# input format 1
def gen_metric_data(data: Dict, output_path: str):
    '''
    Args:
        data: the return value of load_data functions e.g. {'contexts': ...}
        output_path: path to the output file
    '''

# input format 2
def gen_metric_data(data: Dict, base_dir: str, dataset: str):
    '''
    Args:
        data: the return value of load_data functions e.g. {'contexts': ...}
        base_dir: path to the output directory
        dataset: name of the dataset
    '''

We have two input formats. Just follow the one which is easier for you.

Import the function in gen_data.py and follow comments in the code to add the metric.

Then write a function read_metric_result to read the prediction of the metric:

def read_metric_data(data_path: str):
    '''
    Args:
        data_path: path to the prediction file
    
    Return:
        # You can choose to return list or dict
        List: metric scores e.g. [0.2, 0.3, 0.4, ...]
        or 
        Dict: {'metric': List # metric scores}
    '''

Import the function in read_result.py and follow comments in the code to add the metric.

Then just follow the previous evaluation instructions to evaluate the metric.