This repository consists of a benchmark of various text similarity measures for Natural Language Generation (NLG) evaluation, on multiple datasets.
Measuring the similarity between two texts is a fundamental task in Natural Language Processing (NLP). It is used in many applications, such as machine translation, text summarization, and NLG evaluation. As it is very hard to formally define what a good similarity should be, we use the common gold standard of correlation with human judgement.
In this repository, we provide a Python module to easily compute these measures on specific datasets that are particularly relevant for NLG evaluation. Indeed, these datasets contain pairs of texts that are similar, and are annotated by humans to quantify the degree of similarity.
We also provide a way to aggregate these measures through Kemeny ranking consensus [WIP]
We used WMT2016, WMT2017. We provide a downloading script for WMT2016, and data can be found here
We used the following similarity measures:
- BLEU
- CHRF
- METEOR
- ROUGE
- TER
- BERTScore
- BaryScore
- DepthScore
- InfoLM
- SacreBLEU
We provide results for the three classical correlation measures:
- Pearson
- Spearman
- Kendall
Clone the repository:
git clone https://github.com/greg2451/aggregating-text-similarity-metrics.git
-
Install conda locally following this link. We recommend miniconda, it is more lightweight and will be sufficient for our usage.
-
Create a new conda environment by executing the following command in your terminal:
conda create -n aggregating_text_similarity_metrics python=3.10 conda activate aggregating_text_similarity_metrics conda install pip
-
Having the conda environnement activated, install the requirements:
pip install -r requirements.txt
To run the benchmark, you can use the jupyter notebook or run the python file:
export PYTHONPATH=$PYTHONPATH:$(pwd)
python aggregating_text_similarity_metrics/run_wmt_experiment.py
Note that you need to have the
PYTHONPATH
set to the root of the repository, so that theaggregating_text_similarity_metrics
module can be found.
The expected output is to create a folder results
containing the results of the benchmark on all the WMT data we have, for all the similarity measures.
This folder will also contain the correlation results for all the similarity measures.
You can also install the package locally and run the benchmark:
pip install -e .
python -m aggregating_text_similarity_metrics.run_wmt_experiment
This section should only be necessary if you want to contribute to the project.
Start by installing the dev-requirements:
pip install -r dev-requirements.txt
Run the following command at the root of the repository:
pre-commit install