Reported pass@k silently wrong for n<k
Closed this issue · 1 comments
pass@k = 1 should be evidence that in k generations by this model, at least 1 is very likely to pass the test.
However, the definition of estimator returns 1 even when there are 0 passes among 99 tries if k
=100. Nothing in the callers prevents using too small an n
, in fact someone in a hurry is quite likely to use a small n
(as I did in the original issue, oops).
Note in contrast how huggingface/evaluate does deal correctly with the n<k case: if that happens for any result, pass@k for that k
is elided from the dictionary.
Originally posted by @daniel-vainsencher in #31 (comment)
This is a partial solution to this problem.
The script to calculate pass@k now prints the minimum and maximum number of completions per row:
https://github.com/nuprl/MultiPL-E/blob/dev/pass_k.py#L53
For the informed user, when MinCompletions < k, it means that the number in that row is unreliable.
The gold standard is MinCompletions == MaxCompletions == 200
.
But, when operating at scale, it helps to look at intermediate results.