/LLM-evaluator-reliability

The official repository for our ACL 2024 paper, Are LLM-based Evaluators Confusing NLG Quality Criteria?

Primary LanguagePythonApache License 2.0Apache-2.0

Are LLM-based Evaluators Confusing NLG Quality Criteria?

This is the official repository for our ACL 2024 paper Are LLM-based Evaluators Confusing NLG Quality Criteria?

We release the following data and codes used in our work:

  • Aspect criteria (including the different descriptions): aspect_criteria.json
  • Prompts generated for LLM-based evaluation: eval_prompt.py
  • Prompts (including the examples and instructions) and codes of using rules for perturbation constructions: Perturbations/
  • Data for experiments (including the refined reference, perturbed texts, and other information): data_all.json
  • Experimental results for three LLMs (including the average rating for each test item): Eval_results/

Citation

@article{hu2024llm,
  title={Are LLM-based Evaluators Confusing NLG Quality Criteria?},
  author={Hu, Xinyu and Gao, Mingqi and Hu, Sen and Zhang, Yang and Chen, Yicheng and Xu, Teng and Wan, Xiaojun},
  journal={arXiv preprint arXiv:2402.12055},
  year={2024}
}