huggingface/cosmopedia

questions about evaluation like MMLU

Opened this issue · 0 comments

Thank you for sharing.

Some common models like MMLU typically use a 5-shot setting to measure a model's in-context learning capabilities.

Can you explain why MMLU evaluations use a zero-shot plus option content approach?

According to your blog, in this setup, MMLU evaluations are higher than those of QWen1.5B and Phi models, whereas in 5-shot evaluations, the conclusion is the opposite. Is this situation reasonable? Thank you.