/KPEval

Evaluating Keyphrase Extraction and Keyphrase Generation Systems

Primary LanguagePythonMIT LicenseMIT

KPEval 🛠️

KPEval is a toolkit for evaluating your keyphrase systems. 🎯

[Paper] (ACL 2024 Findings) [Tweet]

We provide semantic-based metrics for four evaluation aspects:

  • 🤝 Reference Agreement: evaluating the extent keyphrase predictions align with human-annotated references.
  • 📚 Faithfulness: evaluating whether each keyphrase prediction is semantically grounded in the input.
  • 🌈 Diversity: evaluating whether the predictions include diverse keyphrases with minimal repetitions.
  • 🔍 Utility: evalauting the potential of the predictions to enhance document indexing for improved information retrieval performance.

If you have any questions or suggestions, please submit an issue. Thank you!

News 📰

  • [2024/02] 🚀 We have released the KPEval toolkit.
  • [2023/05] 🌟 The phrase embedding model is now available at uclanlp/keyphrase-mpnet-v1.

Getting Started

We recommend setting up a conda environment:

conda create -n kpeval python=3.8
conda activate kpeval

Installing the required packages:

  • Install torch. Example command if you use CUDA GPUs on linux:

    pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
    
  • pip install -r requirements.txt

We provide the outputs obtained from 21 keyphrase models in this link. Please run tar -xzvf kpeval_model_outputs.tar.gz to uncompress. Please email diwu@cs.ucla.edu or open an issue if the link expires.

Running Evaluation

The execution of all the evaluation aspects are integrated in the run_evaluation.py script. We provide a simple bash script to run the evaluation. You can simply run:

bash run_evaluation.sh [dataset] [model_id] [metric_id]

For example:

bash run_evaluation.sh kp20k 8_catseq semantic_matching

Two log files containing the evaluation results and the per-document scores will be saved to eval_results/[dataset]/[model_id]/. Please see below for the metric_id corresponding to various metrics.

Supported Metrics 📊

The major metrics supported here are the ones introduced in the KPEval paper.

aspect metric metric_id result_field
reference agreement SemF1 semantic_matching semantic_f1
faithfulness UniEval unieval faithfulness-summ
diversity dup_token_ratio diversity dup_token_ratio
diversity emb_sim diversity self_embed_similarity_sbert
utility Recall@5 retrieval sparse/dense_recall_at_5
utility RR@5 retrieval sparse/dense_mrr_at_5

metric_id is the argument to provide to the evaluation script, and result_field is the field in the result file where the metric's results are stored.

Note: to evaluate utility, you need to prepare the training data using DeepKPG and update the config to point to the corpus.

In addition, we support the following metrics from various previous work:

aspect metric metric_id result_field
reference agreement F1@5 exact_matching micro/macro_avg_f1@5
reference agreement F1@M exact_matching micro/macro_avg_f1@M
reference agreement F1@O exact_matching micro/macro_avg_f1@O
reference agreement MAP exact_matching MAP@M
reference agreement NDCG exact_matching avg_NDCG@M
reference agreement alpha-NDCG exact_matching AlphaNDCG@M
reference agreement R-Precision approximate_matching present/absent/all_r-precision
reference agreement FG fg fg_score
reference agreement BertScore bertscore bert_score_[model]_all_f1
reference agreement MoverScore moverscore mover_score_all
reference agreement ROUGE rouge present/absent/all_rouge-l_f
diversity Unique phrase ratio diversity unique_phrase_ratio
diversity Unique token ratio diversity unique_token_ratio
diversity SelfBLEU diversity self_bleu

Using your own models, datasets, or metrics 🛠️

  • New dataset: create a config file at configs/sample_config_[dataset].gin.
  • New model: store your model's outputs at model_outputs/[dataset]/[model_id]/[dataset]_hypotheses_linked.json. The file should be in jsonl format containing three fields: source, target, prediction. If you are conducting reference-free evaluation, you may use a placeholder in the target field.
  • New metric: just implement it in a new file in the metrics folder. The metric class should inherit KeyphraseMetric. Make sure you update metrics/__init__.py and run_evaluation.py. Also make sure you update the config file in configs with the parameters for your new metrics.

Citation

If you find this toolkit useful, please consider citing the following paper.

@inproceedings{wu-etal-2024-kpeval,
    title = "{KPE}val: Towards Fine-Grained Semantic-Based Keyphrase Evaluation",
    author = "Wu, Di  and
      Yin, Da  and
      Chang, Kai-Wei",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.117",
    pages = "1959--1981",
}