Inconsistent Accuracy Evaluation for GPT-3.5-Turbo Solutions
VijayLingam95 opened this issue · 1 comments
Dear Authors,
I was trying to reproduce the reported numbers in your paper from the logs in "programming/root". Instead of using the script get_acc.py
, I re-evaluated the generated solutions (stored in the key solution
) on unit tests (stored in the key test
) using PyExecutor. While I was able to reproduce your results for GPT-4 logs ('programming/root/test_mcts_hard_acc_full_4tst_temp_gpt4'), there is a discrepancy in results for GPT-3.5-Turbo.
Specifically, the key is_solved
is being incorrectly set to True
for incorrect generated solutions, and in a few cases, it is set to False
for correct generated solutions. Below is one instances to illustrate the issue from programming/root/test_mcts_hard_acc_full_4tst_temp/humaneval-py._mcts_8_gpt-3.5-turbo_pass_at_k_1_py.jsonl
.
# Prompt
from typing import List
def select_words(s: str, n: int) -> List[str]:
"""Given a string s and a natural number n, you have been tasked to implement
a function that returns a list of all words from string s that contain exactly
n consonants, in order these words appear in the string s.
If the string s is empty then the function should return an empty list.
Note: you may assume the input string contains only letters and spaces.
Examples:
>>> select_words('Mary had a little lamb', 4)
['little']
>>> select_words('Mary had a little lamb', 3)
['Mary', 'lamb']
>>> select_words('simple white space', 2)
[]
>>> select_words('Hello world', 4)
['world']
>>> select_words('Uncle sam', 3)
['Uncle']
"""
# Generated solution
from typing import List
print('Hello world!')
# Test cases
def check(candidate):
assert candidate('Mary had a little lamb', 4) == ['little']
assert candidate('Mary had a little lamb', 3) == ['Mary', 'lamb']
assert candidate('simple white space', 2) == []
assert candidate('Hello world', 4) == ['world']
assert candidate('Uncle sam', 3) == ['Uncle']
assert candidate('', 4) == []
assert candidate('a b c d e f', 1) == ['b', 'c', 'd', 'f']
def test_check():
check(select_words)
test_check()
# is_solved attribute (incorrectly set to True)
is_solved: True
To further quantify:
Folder: programming/root/test_mcts_hard_acc_full_4tst_temp_2
humaneval-py._mcts_8_gpt-3.5-turbo_pass_at_k_1_py.jsonl
: 2 incorrect solutions flaggedis_solved: True
.mbpp-py._mcts_8_gpt-3.5-turbo_pass_at_k_1_py.jsonl
: 16 incorrect solutions flaggedis_solved: True
, 18 correct solutions flaggedis_solved: False
.
Folder: programming/root/test_mcts_hard_acc_full_4tst_temp
humaneval-py._mcts_8_gpt-3.5-turbo_pass_at_k_1_py.jsonl
: 10 incorrect solutions flaggedis_solved: True
, 5 correct solutions flaggedis_solved: False
.mbpp-py._mcts_8_gpt-3.5-turbo_pass_at_k_1_py.jsonl
: 5 incorrect solutions flaggedis_solved: True
, 1 correct solution flaggedis_solved: False
.
Could this be due to a bug leading to over/under-estimation of real accuracy?
Thanks for reaching out. I commented on the other issue; the bug that led to this has been fixed. The GPT-4 results are correct and were evaluated with the fix.