Reproducing HotpotQA Results
Opened this issue · 0 comments
Hi,
Thanks for the great work. Unfortunately, we are unable to reproduce your results for ReAct / Reflexion on HotpotQA.
E.g. You say that ReAct+gpt-3.5-turbo has a baseline accuracy of 0.26 in Table 5 of your article. However, we tried to reproduce this result using gpt-3.5-turbo on your dataset hotpot-qa-distractor-sample.joblib, but only get a baseline(the 1st trial) accuracy of 0.09. You can see the detailed trajs here: https://github.com/haoyb22/Reflexion_hotpotqa/blob/main/100_questions_5_trials.txt
And we find that the trajs we got show diffferent features from yours. For example, the ReAct agent using gpt-3.5-turbo always tries to answer more than one step a time and doesn't follow the framework. The last bug causes error in Action steps and always stops the program, so we change your agent.py a bit just make the program continue to work. You can see the changed agent.py here: https://github.com/haoyb22/Reflexion_hotpotqa/blob/main/agents.py
Please clarify how to get the results you get?