nuprl/MultiPL-E

Warning: Bash performance results artificially low

Closed this issue · 5 comments

Failure Case 1

Not sure if this is expected failure case of unit tests in bash. Here is an example HumanEval_45_triangle_area

#!/bin/bash
#
#
# $1 is an integer
# $2 is an integer
triangle_area() {
    echo "$1 * $2 / 2.0" | bc -l
}


candidate() {
    triangle_area "$@"
}

set -e
run_test() {
    [[ $(candidate "5" "3") = "7.5" ]]
    [[ $(candidate "2" "2") = "2.0" ]]
    [[ $(candidate "10" "8") = "40.0" ]]
}

run_test

If we print the output of $(candidate "5" "3"), it is "7.500000000", and it is different from the expected "7.5", tests fails. Maybe something with bc to evaluate the numeric value of the strings instead of comparing strings?

Failure Case 2

HumanEval_42_incr_list

#!/bin/bash
#
#
# $1 is a space-separated list
incr_list() {
    for e in $1; do
        echo $((e + 1))
    done
}


candidate() {
    incr_list "$@"
}

set -e
run_test() {
#    [[ $(candidate "") = "" ]]
    echo $(candidate "3 2 1")   # prints -> 4 3 2\n
#    [[ $(candidate "3 2 1") = "4 3 2" ]]
    echo $(candidate "5 2 5 2 3 3 9 0 123")  # prints -> 6 3 6 3 4 4 10 1 124\n
    [[ $(candidate "5 2 5 2 3 3 9 0 123") = "6 3 6 3 4 4 10 1 124" ]]
}

run_test

first test passes, second and third tests fail. And so I printed out the output of each cases.

I tried adding the newline character \n to the end of expected values and that didn't work. My lack of knowledge in Bash is not giving me any idea how it might be fixed.. but I don't think this should fail?

Tagging @mgree

Regarding Case 1: I'm not even sure what the right thing to do here is! What you get will depend on what tools the generated script will shell out to.

For example, here is another solution:

incr_list {
  python3 -c "print(5 * 3 / 2)"
}

This produces "7.5\n".

About Case 2: Here is my hand-written fix. I've edit both the solution and the tests:

#!/bin/bash
#
#
# $1 is a space-separated list
incr_list() {
    for e in $1; do
        echo -n "$((e + 1)) "
    done
}


candidate() {
    incr_list "$@"
}

run_test() {
    [[ $(candidate "") = "" ]]
    echo $?
    [[ $(candidate "3 2 1") = "4 3 2 " ]]
    echo $?
    [[ $(candidate "5 2 5 2 3 3 9 0 123") = "6 3 6 3 4 4 10 1 124 " ]]
    echo $?
}

run_test

This produces:

$ bash incrlist.sh 
0
0
0

I am not sure it is reasonable to prompt a model to produce this solution. I also think its worse than the model generated solution. I think exact-matching on Bash results is a losing proposition. What we should instead do is something fuzzier, but that will be tricky to automate in the MultiPL-E style.

Also, I have tons solutions where it produces Python, from all sorts of models.

I see, yeah I don't see an obvious solution to these problems, perhaps steering away from Bash might be the best solution. Thanks!

I'm going to leave this issue open. It's a warning about interpreting the bash results.