huggingface/evaluate

[FR] Confidence intervals for metrics

NightMachinery opened this issue · 0 comments

It seems that currently simple metrics such as

evaluate.load(
    "accuracy",
)

do not compute a confidence interval. This can be easily fixed by first computing the mean, and the STD, and then dividing the STD by the square of the sample count (to compute the STD of the mean estimate). (See, e.g., here.)

Even just giving back the variance (or STD) is enough, the user can do their own computations on those.