noahshinn/reflexion

Inconsistencies with the humaneval dataset

HamedTaherkhani opened this issue · 3 comments

Comparing the original HumanEval dataset with the one in your repository reveals some inconsistencies. For instance, three instances (HumanEval_32, HumanEval_38, and HumanEval_50) are missing from your version (https://github.com/noahshinn/reflexion/blob/main/programming_runs/benchmarks/humaneval-py.jsonl). Additionally, some test cases have been modified, such as in HumanEval_2, HumanEval_4, HumanEval_33, HumanEval_53, and HumanEval_78.
Please resolve this issue.

Can you provide more context?

  • HumanEval has 164 instances but your version of HumanEval has 161 instances.
  • Some of the test cases in your version of HumanEval have been modified. For example in HumanEval_78, the original test cases are as follows:
def check(candidate):

    assert candidate("AB") == 1, "First test error: " + str(candidate("AB"))      
    assert candidate("1077E") == 2, "Second test error: " + str(candidate("1077E"))  
    assert candidate("ABED1A33") == 4, "Third test error: " + str(candidate("ABED1A33"))      
    assert candidate("2020") == 2, "Fourth test error: " + str(candidate("2020"))  
    assert candidate("123456789ABCDEF0") == 6, "Fifth test error: " + str(candidate("123456789ABCDEF0"))      
    assert candidate("112233445566778899AABBCCDDEEFF00") == 12, "Sixth test error: " + str(candidate("112233445566778899AABBCCDDEEFF00"))  
    assert candidate([]) == 0

However the test cases on your version of HumanEval dataset for this instance are as follows:

  def check(candidate):
    assert candidate('AB') == 1
    assert candidate('1077E') == 2
    assert candidate('ABED1A33') == 4
    assert candidate('2020') == 2
    assert candidate('123456789ABCDEF0') == 6
    assert candidate('112233445566778899AABBCCDDEEFF00') == 12

def test_check():
    check(hex_key)

test_check()

The statement assert candidate([]) == 0 has been removed from test cases.
I want to understand why you made such changes to the dataset.

We used the MultiPL-E benchmark, which includes 161 tasks, we also use MultiPL-E for our Rust experiments.
The HumanEval dataset is not clean, so transformations are required for a sound evaluation. MultiPL-E does the following adjustments to the Python dataset:

Of the 164 original HumanEval benchmarks:
(1) we exclude 3 benchmarks that have Python helper
functions in their prompt; (2) we modify 2 benchmarks to use
unit tests instead of randomized testing; and (3) for certain
typed languages, we fail to compile up to 5 benchmarks with
untranslatable types. These changes do not lead to significantly
different results for Python