KPEval is a toolkit for evaluating your keyphrase systems. 🎯
[Paper] (ACL 2024 Findings) [Tweet]
We provide semantic-based metrics for four evaluation aspects:
- 🤝 Reference Agreement: evaluating the extent keyphrase predictions align with human-annotated references.
- 📚 Faithfulness: evaluating whether each keyphrase prediction is semantically grounded in the input.
- 🌈 Diversity: evaluating whether the predictions include diverse keyphrases with minimal repetitions.
- 🔍 Utility: evalauting the potential of the predictions to enhance document indexing for improved information retrieval performance.
If you have any questions or suggestions, please submit an issue. Thank you!
- [2024/02] 🚀 We have released the KPEval toolkit.
- [2023/05] 🌟 The phrase embedding model is now available at uclanlp/keyphrase-mpnet-v1.
We recommend setting up a conda environment:
conda create -n kpeval python=3.8
conda activate kpeval
Installing the required packages:
-
Install torch. Example command if you use CUDA GPUs on linux:
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
-
pip install -r requirements.txt
We provide the outputs obtained from 21 keyphrase models in this link. Please run tar -xzvf kpeval_model_outputs.tar.gz
to uncompress. Please email diwu@cs.ucla.edu
or open an issue if the link expires.
The execution of all the evaluation aspects are integrated in the run_evaluation.py
script. We provide a simple bash script to run the evaluation. You can simply run:
bash run_evaluation.sh [dataset] [model_id] [metric_id]
For example:
bash run_evaluation.sh kp20k 8_catseq semantic_matching
Two log files containing the evaluation results and the per-document scores will be saved to eval_results/[dataset]/[model_id]/
. Please see below for the metric_id
corresponding to various metrics.
The major metrics supported here are the ones introduced in the KPEval paper.
aspect | metric | metric_id |
result_field |
---|---|---|---|
reference agreement | SemF1 | semantic_matching | semantic_f1 |
faithfulness | UniEval | unieval | faithfulness-summ |
diversity | dup_token_ratio | diversity | dup_token_ratio |
diversity | emb_sim | diversity | self_embed_similarity_sbert |
utility | Recall@5 | retrieval | sparse/dense_recall_at_5 |
utility | RR@5 | retrieval | sparse/dense_mrr_at_5 |
metric_id
is the argument to provide to the evaluation script, and result_field
is the field in the result file where the metric's results are stored.
Note: to evaluate utility, you need to prepare the training data using DeepKPG and update the config to point to the corpus.
In addition, we support the following metrics from various previous work:
aspect | metric | metric_id | result_field |
---|---|---|---|
reference agreement | F1@5 | exact_matching | micro/macro_avg_f1@5 |
reference agreement | F1@M | exact_matching | micro/macro_avg_f1@M |
reference agreement | F1@O | exact_matching | micro/macro_avg_f1@O |
reference agreement | MAP | exact_matching | MAP@M |
reference agreement | NDCG | exact_matching | avg_NDCG@M |
reference agreement | alpha-NDCG | exact_matching | AlphaNDCG@M |
reference agreement | R-Precision | approximate_matching | present/absent/all_r-precision |
reference agreement | FG | fg | fg_score |
reference agreement | BertScore | bertscore | bert_score_[model]_all_f1 |
reference agreement | MoverScore | moverscore | mover_score_all |
reference agreement | ROUGE | rouge | present/absent/all_rouge-l_f |
diversity | Unique phrase ratio | diversity | unique_phrase_ratio |
diversity | Unique token ratio | diversity | unique_token_ratio |
diversity | SelfBLEU | diversity | self_bleu |
- New dataset: create a config file at
configs/sample_config_[dataset].gin
. - New model: store your model's outputs at
model_outputs/[dataset]/[model_id]/[dataset]_hypotheses_linked.json
. The file should be injsonl
format containing three fields:source
,target
,prediction
. If you are conducting reference-free evaluation, you may use a placeholder in the target field. - New metric: just implement it in a new file in the
metrics
folder. The metric class should inheritKeyphraseMetric
. Make sure you updatemetrics/__init__.py
andrun_evaluation.py
. Also make sure you update the config file inconfigs
with the parameters for your new metrics.
If you find this toolkit useful, please consider citing the following paper.
@inproceedings{wu-etal-2024-kpeval,
title = "{KPE}val: Towards Fine-Grained Semantic-Based Keyphrase Evaluation",
author = "Wu, Di and
Yin, Da and
Chang, Kai-Wei",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
month = aug,
year = "2024",
address = "Bangkok, Thailand and virtual meeting",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-acl.117",
pages = "1959--1981",
}