Edward-Sun/easy-to-hard

Two questions about the article

Closed this issue · 4 comments

Great article, thank you for sharing the code. I have two questions.

  1. How do you control the Generator model to only generate one step at a time?
  2. How does the Evaluator model apply the evaluation results of each step to the generation of the next step? Is it iteratively adding the selected step to the prompt?

Hi @xiaolizh1

  1. Since we only do reranking or RL, both of them only require the reward model to score the entire solution. So what we did is simple let the generator model generate the full solution, and then let the evaluator generate the step-wise scores. You can check these two examples:

https://huggingface.co/ScalableMath/llemma-7b-sft-prm800k-level-1to3-hf

https://huggingface.co/ScalableMath/llemma-7b-oprm-prm800k-level-1to3-hf

  1. Same as above

That is to say, PRM is not used in the inference stage, is that so?

Okay, thank you for your patient answer.