lapisrocks/LanguageAgentTreeSearch

Inconsistent Accuracy Evaluation for GPT-3.5-Turbo Solutions

VijayLingam95 opened this issue · 1 comments

Dear Authors,

I was trying to reproduce the reported numbers in your paper from the logs in "programming/root". Instead of using the script get_acc.py, I re-evaluated the generated solutions (stored in the key solution) on unit tests (stored in the key test) using PyExecutor. While I was able to reproduce your results for GPT-4 logs ('programming/root/test_mcts_hard_acc_full_4tst_temp_gpt4'), there is a discrepancy in results for GPT-3.5-Turbo.

Specifically, the key is_solved is being incorrectly set to True for incorrect generated solutions, and in a few cases, it is set to False for correct generated solutions. Below is one instances to illustrate the issue from programming/root/test_mcts_hard_acc_full_4tst_temp/humaneval-py._mcts_8_gpt-3.5-turbo_pass_at_k_1_py.jsonl.

# Prompt
from typing import List

def select_words(s: str, n: int) -> List[str]:
    """Given a string s and a natural number n, you have been tasked to implement 
    a function that returns a list of all words from string s that contain exactly 
    n consonants, in order these words appear in the string s.
    If the string s is empty then the function should return an empty list.
    Note: you may assume the input string contains only letters and spaces.
    Examples:
    >>> select_words('Mary had a little lamb', 4)
    ['little']
    >>> select_words('Mary had a little lamb', 3)
    ['Mary', 'lamb']
    >>> select_words('simple white space', 2)
    []
    >>> select_words('Hello world', 4)
    ['world']
    >>> select_words('Uncle sam', 3)
    ['Uncle']
    """

# Generated solution
from typing import List

print('Hello world!')

# Test cases
def check(candidate):
    assert candidate('Mary had a little lamb', 4) == ['little']
    assert candidate('Mary had a little lamb', 3) == ['Mary', 'lamb']
    assert candidate('simple white space', 2) == []
    assert candidate('Hello world', 4) == ['world']
    assert candidate('Uncle sam', 3) == ['Uncle']
    assert candidate('', 4) == []
    assert candidate('a b c d e f', 1) == ['b', 'c', 'd', 'f']

def test_check():
    check(select_words)

test_check()

# is_solved attribute (incorrectly set to True)
is_solved: True

To further quantify:
Folder: programming/root/test_mcts_hard_acc_full_4tst_temp_2

  1. humaneval-py._mcts_8_gpt-3.5-turbo_pass_at_k_1_py.jsonl: 2 incorrect solutions flagged is_solved: True.
  2. mbpp-py._mcts_8_gpt-3.5-turbo_pass_at_k_1_py.jsonl: 16 incorrect solutions flagged is_solved: True, 18 correct solutions flagged is_solved: False.

Folder: programming/root/test_mcts_hard_acc_full_4tst_temp

  1. humaneval-py._mcts_8_gpt-3.5-turbo_pass_at_k_1_py.jsonl: 10 incorrect solutions flagged is_solved: True, 5 correct solutions flagged is_solved: False.
  2. mbpp-py._mcts_8_gpt-3.5-turbo_pass_at_k_1_py.jsonl: 5 incorrect solutions flagged is_solved: True, 1 correct solution flagged is_solved: False.

Could this be due to a bug leading to over/under-estimation of real accuracy?

Thanks for reaching out. I commented on the other issue; the bug that led to this has been fixed. The GPT-4 results are correct and were evaluated with the fix.