Two questions about the article

Question

Closed this issue 6 months ago · 4 comments

Great article, thank you for sharing the code. I have two questions.

How do you control the Generator model to only generate one step at a time?
How does the Evaluator model apply the evaluation results of each step to the generation of the next step? Is it iteratively adding the selected step to the prompt?

Answer 1 · 2024-04-02T23:10:38.000Z

Since we only do reranking or RL, both of them only require the reward model to score the entire solution. So what we did is simple let the generator model generate the full solution, and then let the evaluator generate the step-wise scores. You can check these two examples:

Answer 2 · 2024-04-03T02:12:27.000Z

That is to say, PRM is not used in the inference stage, is that so？

Answer 3 · 2024-04-03T08:34:45.000Z

Answer 4 · 2024-04-08T01:22:15.000Z

Okay, thank you for your patient answer.