Warning: Bash performance results artificially low
Closed this issue · 5 comments
Failure Case 1
Not sure if this is expected failure case of unit tests in bash. Here is an example HumanEval_45_triangle_area
#!/bin/bash
#
#
# $1 is an integer
# $2 is an integer
triangle_area() {
echo "$1 * $2 / 2.0" | bc -l
}
candidate() {
triangle_area "$@"
}
set -e
run_test() {
[[ $(candidate "5" "3") = "7.5" ]]
[[ $(candidate "2" "2") = "2.0" ]]
[[ $(candidate "10" "8") = "40.0" ]]
}
run_test
If we print the output of $(candidate "5" "3")
, it is "7.500000000"
, and it is different from the expected "7.5"
, tests fails. Maybe something with bc
to evaluate the numeric value of the strings instead of comparing strings?
Failure Case 2
HumanEval_42_incr_list
#!/bin/bash
#
#
# $1 is a space-separated list
incr_list() {
for e in $1; do
echo $((e + 1))
done
}
candidate() {
incr_list "$@"
}
set -e
run_test() {
# [[ $(candidate "") = "" ]]
echo $(candidate "3 2 1") # prints -> 4 3 2\n
# [[ $(candidate "3 2 1") = "4 3 2" ]]
echo $(candidate "5 2 5 2 3 3 9 0 123") # prints -> 6 3 6 3 4 4 10 1 124\n
[[ $(candidate "5 2 5 2 3 3 9 0 123") = "6 3 6 3 4 4 10 1 124" ]]
}
run_test
first test passes, second and third tests fail. And so I printed out the output of each cases.
I tried adding the newline character \n
to the end of expected values and that didn't work. My lack of knowledge in Bash is not giving me any idea how it might be fixed.. but I don't think this should fail?
Tagging @mgree
Regarding Case 1: I'm not even sure what the right thing to do here is! What you get will depend on what tools the generated script will shell out to.
For example, here is another solution:
incr_list {
python3 -c "print(5 * 3 / 2)"
}
This produces "7.5\n".
About Case 2: Here is my hand-written fix. I've edit both the solution and the tests:
#!/bin/bash
#
#
# $1 is a space-separated list
incr_list() {
for e in $1; do
echo -n "$((e + 1)) "
done
}
candidate() {
incr_list "$@"
}
run_test() {
[[ $(candidate "") = "" ]]
echo $?
[[ $(candidate "3 2 1") = "4 3 2 " ]]
echo $?
[[ $(candidate "5 2 5 2 3 3 9 0 123") = "6 3 6 3 4 4 10 1 124 " ]]
echo $?
}
run_test
This produces:
$ bash incrlist.sh
0
0
0
I am not sure it is reasonable to prompt a model to produce this solution. I also think its worse than the model generated solution. I think exact-matching on Bash results is a losing proposition. What we should instead do is something fuzzier, but that will be tricky to automate in the MultiPL-E style.
Also, I have tons solutions where it produces Python, from all sorts of models.
I see, yeah I don't see an obvious solution to these problems, perhaps steering away from Bash might be the best solution. Thanks!
I'm going to leave this issue open. It's a warning about interpreting the bash results.