noahshinn/reflexion

About reflexion temperature(for HotpotQA)

pengjiao123 opened this issue · 5 comments

1 For HotpotQA task, you used =0 to compare the difference between react only and react + reflexion.
Is it because this is a quiz type, so temperature is 0?
At the same time, in this way , can we guarantee that each trial will have a definite and unchanged answer for react only?

2 Have you compared the effect when temperature is not 0 (such as 0.7 or even 1)?
I currently suspect that, for example, Trial = 5, is it possible that the effect of react only can exceed the effect of react+reflexion?
What does the author think, or has done similar experiments?

Good point, this is a gray area that we considered. The motive for running many trials with temperature = 0 was to provide a baseline performance for the model with some configuration. I am curious to know if it is worthwhile to use a higher temperature or a different sampling strategy, but I have not run sufficient experiments to test this. In the experiments for the paper, we wanted to isolate the improvement effect of adding the reflections. Let me know if this answers your question, happy to answer further if needed.

Thank you for your reply
1 We did several experiments and found that after the temperature is not equal to 0 (such as 0.7 or 1), the benefits of react only may be better than the experimental results of react+reflexion.
Previously, we thought that the author used a temperature of 0 because HotPotqa is a multi-hot understanding problem and belongs to knowledge reasoning. Using a temperature of 0 is a common practice for this type of problem, and it can ensure that the output results are relatively fixed.
We are not sure if understanding is correct.
2 The temperature is equal to 0. So in the reaction only , multiple trails are almost stable. This may be because the large model tends to reply to a fixed answer after the temperature is 0, resulting in the output of multiple trails being almost unchanged;
However, the results of reflection have a dramatic change after five times. Afterwards, the results were much better than those with only react.
We believe that after the temperature is 0, the theoretical results of multiple trails will not be very different(react only). However, with the multiple trails of reflexion and the introduction of real labels, the results will definitely get better and better? Is it reasonable?
3 In this experiment, the evaluator uses em of real labels. Is this OK ? Is it better to switch to an LLM that is more capable and has never seen/highly likely never seen this dataset? The currently used em is equivalent to a known label. In this way, the reflexion of multiple trails will intuitively be affected by the reaction only.

Hope your new reply. Thanks

  1. Did you evaluate Reflexion with 0.7 temperature in the action step? I'm curious to know if there is a performance boost observed. The react run is a baseline so it is important to only look for relative improvement.
  2. This is correct
  3. Yes, it is a long-standing problem to evaluate short-answer QA in any context properly. HotPotQA was the only task in which we used ground truth labels. Connecting to my question in (1), I'd be interested to see the relative improvement
  1. Did you evaluate Reflexion with 0.7 temperature in the action step? I'm curious to know if there is a performance boost observed. The react run is a baseline so it is important to only look for relative improvement.
  2. This is correct
  3. Yes, it is a long-standing problem to evaluate short-answer QA in any context properly. HotPotQA was the only task in which we used ground truth labels. Connecting to my question in (1), I'd be interested to see the relative improvement

1 We found that the effect of React only is better than React+Reflexion with 0.7 or 1 temperature.
And we think the experiment with 0 temperature is unfair because React only is almost unchanged due to the temperature. We believe that you need to adopt other temperatures to verify the feasibility of the experimental plan.
we suggest that the author try conducting experiments at different temperatures to confirm the effectiveness of the results.
2 HotPotQA was the only task in which you used ground truth labels. But we believe that models or rules should be used for evaluation, such as gpt-3.5 for actors and gpt-4 for evaluation; or adopt some experience or rule for evaluation (if possible). I can understand the author's ideas (the framework of reflection is valuable), but in my opinion, using real labels in this experiment is unreasonable.
The idea is okay, but the experimental method may be incorrect.

@pengjiao123 can you show some results for your first claim here?