model selection of PPO in Table 2

Question

model selection of PPO in Table 2

langhaobeijing opened this issue a year ago · 1 comments

Hi, thank you for your great work here!

After running ppo script (examples/scripts/rlhf_ppo.sh) from your code, there are multiple checkpoints of finetuned PPO models from different training steps.

I wonder how the checkopint is selected for PPO results in Table 2.

based on the validation split (2k) or the evaluation data (805)?
based on scores of the trained reward model or simulated preferences from p_sim^eval?

Thank you!

Answer 1 · 2023-08-01T07:12:29.000Z

Thanks for your interest!

Our final Table 2 models were primarily selected based on p_sim^eval with self-instruct eval data. For the runs on human preferences, we also performed human eval on some model checkpoints for PPO and different k's for rerank to ensure the final results weren't in the over-optimization (see Section 4 of our paper) regime.