/true

Code and data accompanying the paper "TRUE: Re-evaluating Factual Consistency Evaluation".

Primary LanguagePythonApache License 2.0Apache-2.0

TRUE

Note: This is not an officially supported Google product.

This repository includes code and data accompanying our NAACL 2022 paper "TRUE: Re-evaluating Factual Consistency Evaluation".

Model

We open sourced an NLI model based on this work on HuggingFace.

Data

We provide a script that downloads all 11 datasets used in TRUE and converts them to a standardized binary scheme. To download the datasets, first download the nli-fever directory and extract it. After extraction, the working directory should contain an nli_fever directory.

Note: The SummEval datatset requires pairing summaries with the original CNN/DailyMail articles. Downloading these (via the HuggingFace Datasets library) might take a while.

To download and standardize the datasets for TRUE, run:

python true/src/download_datasets.py

For computing the TRUE scores, use the "grounding", "generated_text" and "label" columns in the resulting csv files. The "label" column will contain the binary label to evaluate against.

Usage

To compute meta-evaluation scores for new metrics, you should have a csv file with one column per metric and a label column. Scores should be in range [0,1].

For example:

     ...  label  metric_1  metric_2    metric_3
0    ...  0      0.5       0.666666    0.42
1    ...  0      0.2       1           0.75
2    ...  1      0.8       0.5         0

run:

python true/src/meta_evaluation_scores.py \
       --input_path=data/"{INPUT_FILE}" \
       --metrics_scores_columns="{METRICS}" \
       --output_path="{OUTPUT_PATH}"

metrics_scores_columns should be a comma-separated list of metrics to test, e.g.

metrics_scores_columns="metric_1,metric_2,metric_3"

The output file is a meta-evaluation report, containing ROC AUC scores, as well as accuracy and grounded/ungrounded precision and recall. Please note that other than the ROC AUC, all other measures require threshold tuning and use a default threshold of 0.5, which is not necessarily suitable for all metrics. To pass custom thresholds, use --thresholds.

Similarly, to create the ROC plot, run:

python true/src/meta_evaluation_plots.py \
       --input_path=data/"{INPUT_FILE}" \
       --metrics_scores_columns="{METRICS}" \
       --output_path="{OUTPUT_PATH}"