tjunlp-lab/M3KE

Can you provide some details of the evaluation code for the reported results?

Opened this issue · 1 comments

This assessment data also appears to be in the form of multiple choice tasks similar to MMLU, but there are many detailed differences in the practice of MMLU, and these detailed differences have a significant impact on the quality outcome value. Among them, the accuracy calculation method provided by MMLU is based on the probability normalization of four options, from which the maximum probability is selected as the prediction result. However, many others have changed it to the generated form and then extracted the ABCD option from the generated answer, and the prompt setting and the extraction method of the answer will affect the final result.

So what are the evaluation code details based on which the results table is reported in your repository?

Thank you for your interest in the M3KE dataset. As you mentioned, different evaluation methods can lead to significant differences in experimental results. We attempted to select the label with the highest probability from four choices as the final answer. However, we found that LLMs such as BLOOM-7b1 typically only choose one label, even when the question differs in zero and few-shot scenarios. As a result, we decided to extract ABCD from the model generation and limit the maximum generation length as short as possible. If more than one label is present in the generation, we consider the model's answer to be incorrect.

We plan to make the questions of the M3KE dataset publicly available before the end of June. We would be greatly appreciate if you use the M3KE dataset.