Support for standard deviation
Closed this issue · 1 comments
RodriMora commented
Hi!
I'm really linking this benchmark and using it for my tests. But I'm noticing that even with temperature set at 0.0, many inference engines are not fully deterministic.
Would it be possible to add an standard deviation to the output to further improve the confidence in the results?
Wyyyb commented
Hi, thanks for your feedback on the benchmark. Are you referring to differences in results obtained from the same hardware using various inference engines like vllm, lmdeploy, and direct inference? Or are you referring to differences in results when running our 'evaluate_from_local.py' script on different hardware devices?