LATS uses real labels in HumanEval for reward calculation. This seems to be quite different from reflexion (it only uses internal tests). Is this correct?

Question

LATS uses real labels in HumanEval for reward calculation. This seems to be quite different from reflexion (it only uses internal tests). Is this correct?

geekan opened this issue a year ago · 3 comments

Answer 1 · 2023-11-10T22:15:25.000Z

Hi, LATS only uses internal tests during the search process. The real tests are only used whenever all the internal tests are passed. I can see why there might be confusion when setting the reward, but the reward_real added to the overall reward is always 0 so it doesn't affect the reward calculation or search (it is only 1 when the real tests all pass, in which the search will also terminate)

Answer 2 · 2023-11-11T09:04:47.000Z

Thank you very much for your answer. In order to further clarify the problem and understand the details, I compared the implementation of LATS and Reflexion. I have several specific questions that I hope to be answered.

This seems to mark internal test passing as success, but should it be num_success += int(is_passing)?

Here, after the internal test case passes, the real label is used for evaluation, and the Reward is updated. Won’t the actual indicator change to pass@iter?

LATS: https://github.com/andyz245/LanguageAgentTreeSearch/blob/5fa1ce7ed51a6f750905b2319ab0e653807c8aca/programming/mcts.py#L212
Reflexion (regardless of the result, it will exit once executed, so this is pass@1): https://github.com/andyz245/LanguageAgentTreeSearch/blob/5fa1ce7ed51a6f750905b2319ab0e653807c8aca/programming/reflexion.py#L93

Answer 3 · 2023-11-13T16:59:06.000Z

Hi, thanks for pointing this out. For the first, is_passing would always be 1 in the loop, so the functionality is the same. The second question is indeed a bug; the break should be outside. I made some modifications to the code when preparing the release, but I will update it shortly.