Add HumanEval+ tests

In "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation", Liu et al. introduced additional auto-generated tests to HumanEval, which reduced pass rates significantly for some modes (e.g., 32.2->27.2 for CodeGen 16B or 88.4->76.2 for GPT-4). I think it will be useful to add these tests to MultiPL-E.
Their code is available at https://github.com/evalplus/evalplus

This may be a good first contribution to MultiPL-E that is also very significant. :)

I see what you did there 🙃
Do you mind briefly explain what that would amount to, or, alternatively, refer me to relevant docs?

I think it may be pretty easy. Here is HumanEval:

https://github.com/nuprl/MultiPL-E/tree/main/datasets/originals-with-cleaned-doctests

All you have to do is duplicate that directory and paste in the new tests (Don't care how -- automatic or manual doesn't matter)

Once that's done, you "prepare prompts" for multiple languages with this script:

https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/all_prepare_prompts.py#L55

You use your new directory as the originals argument.

That creates these JSON files:

https://github.com/nuprl/MultiPL-E/tree/main/prompts

That's all that goes on the Hub.

@arjunguha Is there a way to perform a floating point comparison for lists? I.e., something like

assert all(abs(x-y)<1e-6 for x,y in zip(input_list, output_list))

Also, where do I put the artifacts? Should I upload them to a separate dataset on HF? Some files are fairly large (order of 100mb).

Also some prompts have very large ints as a results, do we want to keep these? Does translation support ints greater than (say) 2^64?

Also this one appear to fail due to ellipsis?
https://github.com/nuprl/MultiPL-E/blob/main/datasets/originals-with-cleaned-doctests/HumanEval_148_bf.py#L3

@arjunguha Is there a way to perform a floating point comparison for lists? I.e., something like
assert all(abs(x-y)<1e-6 for x,y in zip(input_list, output_list))

-> ```

Is this a kind of assertion that's in HumanEval+? If so, it will require different infrastructure to translate.

Also, where do I put the artifacts? Should I upload them to a separate dataset on HF? Some files are fairly large (order of 100mb).

Wow, really? Could you show me an example? This is not at all a problem with HumanEval, which is why the prompts are just in this repository.

Also some prompts have very large ints as a results, do we want to keep these? Does translation support ints greater than (say) 2^64?

Depends on the target language. For most targets I would say no.

Also this one appear to fail due to ellipsis?
https://github.com/nuprl/MultiPL-E/blob/main/datasets/originals-with-cleaned-doctests/HumanEval_148_bf.py#L3

There are a couple of problems that fail to translate to typed targets. This is one of them. It's easy to argue both ways: clearly the problem should use a list and not a tuple, but its also clear that changing the tuple to list would have been a far more significant change than the other tweaks we made.

Is this a kind of assertion that's in HumanEval+? If so, it will require different infrastructure to translate.

It is something that HumanEval+ does, but I think it is should be done for some tests in HUmanEval too. For example:

MultiPL-E/datasets/originals-with-cleaned-doctests/HumanEval_21_rescale_to_unit.py

Line 25 in 51b0c9b

assert candidate([2.0, 49.9]) == [0.0, 1.0]

uses equality to compare list of floats, which may not work. HumanEval+ uses folating point comparison if the output is floating point or a sequence of these. They also add some new tests with floating point numbers to other problems, so these issue involves more problems in case of HumanEval+ than in case of HumanEval

Wow, really? Could you show me an example?

Ignoring the cases which use huge numbers for now, the problematic ones (only ones larger than 10MB, both are larger than 100MB) are problems 15 and 130. The first one has output size linear in input value, so just a set of large inputs results in large test. The second one potentially generates infinite lists I guess, but in our case just a number of very large ones.

Depends on the target language. For most targets I would say no.

I'll ignore these for now then.

I see, so the original had floating point comparisons too:

https://github.com/nuprl/MultiPL-E/blob/main/datasets/originals/HumanEval_21_rescale_to_unit.py

My sense is that doing floating point (1) well and (2) in a way that is actually portable across 19 languages is a lot of effort that may not be worthwhile.

After all, its not that the underlying benchmarks are perfect themselves. This is my favorite:

import datasets

d = datasets.load_dataset("mbpp")
item = d["validation"].filter(lambda x: "polar" in x["text"])
print(item[0]["text"])
print("\n".join(item[0]["test_list"]))

Write a function to convert polar coordinates to rectangular coordinates.
assert polar_rect(3,4)==((5.0, 0.9272952180016122), (-2+2.4492935982947064e-16j))
assert polar_rect(4,7)==((8.06225774829855, 1.0516502125483738), (-2+2.4492935982947064e-16j))
assert polar_rect(15,17)==((22.67156809750927, 0.8478169733934057), (-2+2.4492935982947064e-16j))

My sense is that doing floating point (1) well and (2) in a way that is actually portable across 19 languages is a lot of effort that may not be worthwhile.

It doesn't feel that implementing abs(x-y)<z is too hard, but I may be naive here. Do you think it's better to drop such tests all along or add them as exact comparison? Feels unfair to discqualify the whole task based on wrong fp comparison if only small share of tests are fp.

I don't want to discourage anyone from trying new things. But, this is something I would not personally do.

That leaves the question whether we drop the tests which compare stuff with floating points variables or use exact comparison for them open

ping @arjunguha

I don't know. :)

Fair enough, I suggest dropping them whatsoever; that would leave us, possibly, with some false positives that HumanEval+ can catch.

I think I'll ignore the large tests as for now and prepare some initial PR. The ignore is the simple parameter in the code so we'll be easily able to change our mind later; once again that may make tests weaker than HumanEval+, but still stronger than current ones.