Two questions about the article
Closed this issue · 4 comments
xiaolizh1 commented
Great article, thank you for sharing the code. I have two questions.
- How do you control the Generator model to only generate one step at a time?
- How does the Evaluator model apply the evaluation results of each step to the generation of the next step? Is it iteratively adding the selected step to the prompt?
Edward-Sun commented
Hi @xiaolizh1
- Since we only do reranking or RL, both of them only require the reward model to score the entire solution. So what we did is simple let the generator model generate the full solution, and then let the evaluator generate the step-wise scores. You can check these two examples:
https://huggingface.co/ScalableMath/llemma-7b-sft-prm800k-level-1to3-hf
https://huggingface.co/ScalableMath/llemma-7b-oprm-prm800k-level-1to3-hf
- Same as above
xiaolizh1 commented
That is to say, PRM is not used in the inference stage, is that so?
Edward-Sun commented
The link above is a OPRM or PRM: https://huggingface.co/ScalableMath/llemma-7b-oprm-prm800k-level-1to3-hf
xiaolizh1 commented
Okay, thank you for your patient answer.