Inconsistencies with the humaneval dataset
HamedTaherkhani opened this issue · 3 comments
Comparing the original HumanEval dataset with the one in your repository reveals some inconsistencies. For instance, three instances (HumanEval_32, HumanEval_38, and HumanEval_50) are missing from your version (https://github.com/noahshinn/reflexion/blob/main/programming_runs/benchmarks/humaneval-py.jsonl). Additionally, some test cases have been modified, such as in HumanEval_2, HumanEval_4, HumanEval_33, HumanEval_53, and HumanEval_78.
Please resolve this issue.
Can you provide more context?
- HumanEval has 164 instances but your version of HumanEval has 161 instances.
- Some of the test cases in your version of HumanEval have been modified. For example in HumanEval_78, the original test cases are as follows:
def check(candidate):
assert candidate("AB") == 1, "First test error: " + str(candidate("AB"))
assert candidate("1077E") == 2, "Second test error: " + str(candidate("1077E"))
assert candidate("ABED1A33") == 4, "Third test error: " + str(candidate("ABED1A33"))
assert candidate("2020") == 2, "Fourth test error: " + str(candidate("2020"))
assert candidate("123456789ABCDEF0") == 6, "Fifth test error: " + str(candidate("123456789ABCDEF0"))
assert candidate("112233445566778899AABBCCDDEEFF00") == 12, "Sixth test error: " + str(candidate("112233445566778899AABBCCDDEEFF00"))
assert candidate([]) == 0
However the test cases on your version of HumanEval dataset for this instance are as follows:
def check(candidate):
assert candidate('AB') == 1
assert candidate('1077E') == 2
assert candidate('ABED1A33') == 4
assert candidate('2020') == 2
assert candidate('123456789ABCDEF0') == 6
assert candidate('112233445566778899AABBCCDDEEFF00') == 12
def test_check():
check(hex_key)
test_check()
The statement assert candidate([]) == 0
has been removed from test cases.
I want to understand why you made such changes to the dataset.
We used the MultiPL-E benchmark, which includes 161 tasks, we also use MultiPL-E for our Rust experiments.
The HumanEval dataset is not clean, so transformations are required for a sound evaluation. MultiPL-E does the following adjustments to the Python dataset:
Of the 164 original HumanEval benchmarks:
(1) we exclude 3 benchmarks that have Python helper
functions in their prompt; (2) we modify 2 benchmarks to use
unit tests instead of randomized testing; and (3) for certain
typed languages, we fail to compile up to 5 benchmarks with
untranslatable types. These changes do not lead to significantly
different results for Python