nuprl/MultiPL-E

Reported pass@k silently wrong for n<k

Closed this issue · 1 comments

pass@k = 1 should be evidence that in k generations by this model, at least 1 is very likely to pass the test.

However, the definition of estimator returns 1 even when there are 0 passes among 99 tries if k=100. Nothing in the callers prevents using too small an n, in fact someone in a hurry is quite likely to use a small n (as I did in the original issue, oops).

Note in contrast how huggingface/evaluate does deal correctly with the n<k case: if that happens for any result, pass@k for that k is elided from the dictionary.

Originally posted by @daniel-vainsencher in #31 (comment)

This is a partial solution to this problem.

The script to calculate pass@k now prints the minimum and maximum number of completions per row:

https://github.com/nuprl/MultiPL-E/blob/dev/pass_k.py#L53

For the informed user, when MinCompletions < k, it means that the number in that row is unreliable.

The gold standard is MinCompletions == MaxCompletions == 200.

But, when operating at scale, it helps to look at intermediate results.