model selection of PPO in Table 2
langhaobeijing opened this issue · 1 comments
langhaobeijing commented
Hi, thank you for your great work here!
After running ppo script (examples/scripts/rlhf_ppo.sh) from your code, there are multiple checkpoints of finetuned PPO models from different training steps.
I wonder how the checkopint is selected for PPO results in Table 2.
- based on the validation split (2k) or the evaluation data (805)?
- based on scores of the trained reward model or simulated preferences from p_sim^eval?
Thank you!
lxuechen commented
Thanks for your interest!
Our final Table 2 models were primarily selected based on p_sim^eval with self-instruct eval data. For the runs on human preferences, we also performed human eval on some model checkpoints for PPO and different k's for rerank to ensure the final results weren't in the over-optimization (see Section 4 of our paper) regime.