questions about evaluation like MMLU
Opened this issue · 0 comments
ftgreat commented
Thank you for sharing.
Some common models like MMLU typically use a 5-shot setting to measure a model's in-context learning capabilities.
Can you explain why MMLU evaluations use a zero-shot plus option content approach?
According to your blog, in this setup, MMLU evaluations are higher than those of QWen1.5B and Phi models, whereas in 5-shot evaluations, the conclusion is the opposite. Is this situation reasonable? Thank you.