nuprl/MultiPL-E

Fix some easy paths to wrong evaluation results

daniel-vainsencher opened this issue · 5 comments

As a newcomer to BigCode I wanted to use MultiPL-E to confirm pass@k perf for bigcode models before starting experiments. I loosely followed the tutorial/advice from slack, and ran into a few different surprises in the values of pass@k reported.

What I ran (with relevant outputs only):

# setup
$ git clone https://github.com/nuprl/MultiPL-E
$ cd MultiPL-E/
$ mkdir tutorial
$ python3 -m inference --model-name inference.bigcode_dedupaltcomments --root-dataset humaneval --lang py     --temperature 0.2 --batch-size 20 --completion-limit 20 --output-dir-prefix tutorial
$ podman run --rm --network none -v ./tutorial:/tutorial:rw multipl-e-eval --dir /tutorial --output-dir /tutorial --recursive
# surprises start here:
$ python3 src/single_experiment_pass_k.py ./tutorial/humaneval-py-bigcode_1B_080e3b87d19ace8aa4f72c30e5458cab820644dc_dedupaltcomments-0.2-reworded/
Dataset,Pass@k,Estimate
,10,0.25
,100,1.00
$ python3 src/single_experiment_pass_k.py ./tutorial/humaneval-py-bigcode_1B_080e3b87d19ace8aa4f72c30e5458cab820644dc_dedupaltcomments-0.2-reworded
Dataset,Pass@k,Estimate
humaneval-py-bigcode_1B_080e3b87d19ace8aa4f72c30e5458cab820644dc_dedupaltcomments-0.2-reworded,1,0.18

Notice that the last two commands differ only in the "/" at the end.

Surprise 1: we got a perfect pass@100=1 despite the fact that we ran at 20 generations per problem, and reported pass@10 was 0.25 so the probability of failing at 100 is certainly not 0.

This behavior seems to be implied by the combinatorial definition of

def estimator(n: int, c: int, k: int) -> float:
in its comment. Regardless of anything else, I think that n<k should emit a NaN.

Surprise 2: in the first invocation we received a report of pass@10 and pass@100, though the intent (as shown in the second run) is to only report p@1 when the temperature is 0.2 (per conventions).

Surprise 3: that the pass@k evaluation code itself attempts to enforce what pass@k are reported at what temperatures without either notifying me that that is what is happening or (better in my opinion) me opting into that convention.

I believe this has been addressed!

Good to see. About surprise 1 (definition of pass@k for n<k) I've since found the exact same code in the upstream evaluation harness, perhaps I'll follow up there.