RLHFlow/Online-RLHF

Cannot Reproduce the DPO Checkpoint

Closed this issue · 1 comments

Hi,

I tried to reproduce the training process from sft to dpo. I ran the run_loop.sh script, the only change I made is setting initial_model="RLHFlow/LLaMA3-SFT". After 3 iterations, the final model checkpoint of iteration 3 has a mtbench score of 7.95, which is different than the reported number. The initial sft start point has the same mtbench score as the paper reported.

I did not modify other settings in run_loop.sh script. Please let me know if there is anything additional needed to reproduce.

Can you use bsz=128 instead? lr=5e-7 might be too large for bsz=32.