Critical Bug: Incorrect Success Count in `programming/mcts.py` Severely Affects Reported Accuracy Metrics

Dear Authors,

I attempted to reproduce the results on the programming task from your paper. However, I encountered a critical issue in the programming/mcts.py file: the num_success counter is incorrectly incremented without verifying if the generated solution passed the actual test. For your reference, I have included the relevant code block below:

LanguageAgentTreeSearch/programming/mcts.py

Lines 134 to 144 in 43901ce

    
           # if solved, exit early 
        
           if is_passing: 
        
               is_passing = exe.evaluate( 
        
                   item["entry_point"], cur_func_impl, item["test"], timeout=10) 
        
               is_solved = is_passing 
        
               num_success += 1 
        
               item["acc"] = round(num_success/(idx+1), 2) 
        
               write_jsonl(log_path, [item], append=True) 
        
               print(num_success) 
        
               print_v(f'completed {idx+1}/{num_items}: acc = {round(num_success/(idx+1), 2)}') 
        
               continue

Specifically, num_success += 1 should be replaced with num_success += int(is_passing). After running your code (with max_iters=8 and number_of_tests=4) using the GPT-3.5-Turbo model, I noticed from the logs that 21 incorrect solutions were contributing to the accuracy metric. After fixing this bug, the accuracy on HumanEval dropped from 86.95% (terminal output: completed 161/161: acc = 0.87, ran on 161 HumanEval problems from Reflexion) to 73.91%.

Additionally, from the commit history, it appears that the GPT-4 model was used to generate synthetic tests. Could you please confirm if this is the case?

Hi, thanks for pointing this out.

This is indeed a bug we noticed and fixed for GPT-4 but did not fix for GPT-3.5. We will reevaluate the results for this model and update the paper and repository.

For this particular set of solutions, it was the same GPT-3.5 model that generated the synthetic tests. This was made to be more flexible in the code later on.

	# if solved, exit early
	if is_passing:
	is_passing = exe.evaluate(
	item["entry_point"], cur_func_impl, item["test"], timeout=10)
	is_solved = is_passing
	num_success += 1
	item["acc"] = round(num_success/(idx+1), 2)
	write_jsonl(log_path, [item], append=True)
	print(num_success)
	print_v(f'completed {idx+1}/{num_items}: acc = {round(num_success/(idx+1), 2)}')
	continue