ErrorAnalysis Prompt for MT Evaluation in ChatGPT

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT. (Full report)

This repository releases the testsets and the scores evaluated by text-davinci-003 and ChatGPT, for the replication of the study.

Abstract

Generative large language models (LLMs), e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks such as machine translation, question answering, text summarization, and natural language understanding. Recent research (Kocmi and Federmann, 2023) has shown that utilizing ChatGPT for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level but performs poorly at the segment level. To further improve the performance of LLMs on MT quality assessment, we conducted an investigation into several prompting methods. Our results indicate that by combining Chain-of-Thoughts (Wei et al., 2022) and Error Analysis (Lu et al., 2022), a new prompting method called Error Analysis Prompting, LLMs like ChatGPT can generate human-like MT evaluations at both the system and segment level. Additionally, we discovered some limitations of ChatGPT as an MT evaluator, such as unstable scoring and biases when provided with multiple translations in a single query. Our findings aim to provide a preliminary experience for appropriately evaluating translation quality on ChatGPT while offering a variety of tricks in designing prompts for in-context learning. We anticipate that this report will shed new light on advancing the field of translation evaluation with LLMs by enhancing both accuracy and reliability of metrics.

Data and Evaluations

For each language pair, we divide the segments from WMT20 testset into four groups based on the number of tokens they contain (15-24, 25-34, 35-44, 45-54). We randomly sample 10 segments from each group and form a new dataset containing 40 segments. We utilize Multidimentional Quality Metric (MQM) as human evaluation. The test data and its corresponding evaluation scores can be obtained in "./data".

The task statistics are shown as follows:

An overview of Error Analysis Prompting

An overview of our error analysis prompting. Detailed prompt contexts can be obtained in "./prompts".

Results and Findings

🙂 Our EA Prompting outperforms standard prompting at the segment level, achieving human-like evaluations at both the system level and segment level.

System & Segment level performance on our testset:

🤔 When designing prompts, itemized responses are better than lengthy and detailed explanations of errors. Moreover, splitting the instruction into two identifying errors and scoring translation can improve evaluation stability.

An comparison on different prompt designs, and their prompt contexts:

😐 The boosted performance from EA prompting is observed in the zero-shot scenario on text-davinci-003 rather than in the few-shot scenario, which indicates that we need to adjust our settings when utilizing other GPT models.
❗ Despite its good performance, we show that ChatGPT is NOT a stable evaluator and may score the same translation differently.

❗ It is NOT advisable to combine multiple translations into a single query input, as ChatGPT has a preference for former translations.

Please refer to our full report & arXiv preprint for more details.

Citation

If you find this work helpful, please consider citing as follows:

@article{Lu2023EAPrompt,
  title={Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT},
  author={Lu, Qingyu and Qiu, Baopu and Ding, Liang and Zhang, Kanjian and Kocmi, Tom and Tao, Dacheng},
  journal={arXiv preprint},
  url={https://arxiv.org/pdf/2303.13809.pdf},
  year={2023}
}

Coldmist-Lu/ErrorAnalysis_Prompt