TIGER-AI-Lab/MMLU-Pro

Support for standard deviation

Closed this issue · 1 comments

Hi!

I'm really linking this benchmark and using it for my tests. But I'm noticing that even with temperature set at 0.0, many inference engines are not fully deterministic.

Would it be possible to add an standard deviation to the output to further improve the confidence in the results?

Hi, thanks for your feedback on the benchmark. Are you referring to differences in results obtained from the same hardware using various inference engines like vllm, lmdeploy, and direct inference? Or are you referring to differences in results when running our 'evaluate_from_local.py' script on different hardware devices?