Fix some easy paths to wrong evaluation results
daniel-vainsencher opened this issue · 5 comments
As a newcomer to BigCode I wanted to use MultiPL-E to confirm pass@k perf for bigcode models before starting experiments. I loosely followed the tutorial/advice from slack, and ran into a few different surprises in the values of pass@k reported.
What I ran (with relevant outputs only):
# setup
$ git clone https://github.com/nuprl/MultiPL-E
$ cd MultiPL-E/
$ mkdir tutorial
$ python3 -m inference --model-name inference.bigcode_dedupaltcomments --root-dataset humaneval --lang py --temperature 0.2 --batch-size 20 --completion-limit 20 --output-dir-prefix tutorial
$ podman run --rm --network none -v ./tutorial:/tutorial:rw multipl-e-eval --dir /tutorial --output-dir /tutorial --recursive
# surprises start here:
$ python3 src/single_experiment_pass_k.py ./tutorial/humaneval-py-bigcode_1B_080e3b87d19ace8aa4f72c30e5458cab820644dc_dedupaltcomments-0.2-reworded/
Dataset,Pass@k,Estimate
,10,0.25
,100,1.00
$ python3 src/single_experiment_pass_k.py ./tutorial/humaneval-py-bigcode_1B_080e3b87d19ace8aa4f72c30e5458cab820644dc_dedupaltcomments-0.2-reworded
Dataset,Pass@k,Estimate
humaneval-py-bigcode_1B_080e3b87d19ace8aa4f72c30e5458cab820644dc_dedupaltcomments-0.2-reworded,1,0.18
Notice that the last two commands differ only in the "/" at the end.
Surprise 1: we got a perfect pass@100=1 despite the fact that we ran at 20 generations per problem, and reported pass@10 was 0.25 so the probability of failing at 100 is certainly not 0.
This behavior seems to be implied by the combinatorial definition of
in its comment. Regardless of anything else, I think that n<k should emit a NaN.Surprise 2: in the first invocation we received a report of pass@10 and pass@100, though the intent (as shown in the second run) is to only report p@1 when the temperature is 0.2 (per conventions).
Surprise 3: that the pass@k evaluation code itself attempts to enforce what pass@k are reported at what temperatures without either notifying me that that is what is happening or (better in my opinion) me opting into that convention.
I believe this has been addressed!
Good to see. About surprise 1 (definition of pass@k for n<k) I've since found the exact same code in the upstream evaluation harness, perhaps I'll follow up there.