Duplicates in test split
Closed this issue · 1 comments
Pupy101 commented
Hello, can you help me? There are 159 questions with duplicates in the test part. Here is the code to find duplicates:
from collections import defaultdict
import datasets
test = datasets.load_dataset("TIGER-Lab/MMLU-Pro", split="test")
mapping = defaultdict(int)
for item in test:
mapping[(item["category"], item["question"], "".join(item["options"]), item["answer"])] += 1
count_doubles = 0
for (category, question, *_), count in mapping.items():
if count > 1:
print(category, repr(question))
count_doubles += 1
print(count_doubles)
Wyyyb commented
Thank you for pointing out these duplicates. These duplicate data will have minimal impact on the evaluation results, so we have decided not to remove them for the time being to maintain consistency in data quantity.