nuprl/MultiPL-E

C++ test float comparison

PootieT opened this issue · 3 comments

Example: HumanEval45_triangle_area

float triangle_area(long a, long b, long c) {
   // C++ program
}
int main() {
    auto candidate = triangle_area;
    assert(candidate((3), (4), (5)) == (6.0));
    assert(candidate((1), (2), (10)) == (float(-1)));
    assert(candidate((4), (8), (5)) == (8.18));
    assert(candidate((2), (2), (2)) == (1.73));
    assert(candidate((1), (2), (3)) == (float(-1)));
    assert(candidate((10), (5), (7)) == (16.25));
    assert(candidate((2), (6), (3)) == (float(-1)));
    assert(candidate((1), (1), (1)) == (0.43));
    assert(candidate((2), (2), (10)) == (float(-1)));
}

When comparing float outputs, often the tests would fail (in this case, the test that failed for me was the 3rd one), because I think C++ defaults instantiations like this 8.18 to a double type, which then doesn't match with the program output (float type)

there is at least another failure point in HumanEval_4_mean_absolute_deviation, but there could be many more.

float mean_absolute_deviation(std::vector<float> numbers) {
    // C++ program
}
int main() {
    auto candidate = mean_absolute_deviation;
    assert(candidate((std::vector<float>({(float)1.0, (float)2.0}))) == (0.5));
    assert(candidate((std::vector<float>({(float)1.0, (float)2.0, (float)3.0, (float)4.0}))) == (1.0));
    assert(candidate((std::vector<float>({(float)1.0, (float)2.0, (float)3.0, (float)4.0, (float)5.0}))) == (1.2));
}

Thanks for pointing this out! CC @abhijangda

Thank you for pointing this out.

Since double has higher precision than float, numbers with 2 or more numbers after decimal are not equal while with less than 2 are equal.

For example:

` #include

int main() {
std::cout<<"eq 1 " << (1.0 == 1.0f) << std::endl;
std::cout<<"eq 0 " << (0.0 == 0.0f) << std::endl;
std::cout<<"eq 2 " << (2.18 == 2.18f) << std::endl;
}
`

Gives the output:
eq 1 1 eq 0 1 eq 2 0

I will soon push the fix.

I believe these are the affected files for C++:

../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_0_has_close_elements.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_133_sum_squares.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_137_compare_one.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_151_double_the_difference.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_20_find_closest_elements.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_21_rescale_to_unit.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_22_filter_integers.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_2_truncate_number.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_45_triangle_area.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_47_median.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_4_mean_absolute_deviation.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_71_triangle_area.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_81_numerical_letter_grade.results.json.gz
../experiments/humaneval-cpp-bigcode_15b_800m-0.2-reworded/HumanEval_92_any_int.results.json.gz

pass@1 increases from 27.15% to 27.61% on this model.