Cannot Reproduce the DPO Checkpoint
Closed this issue · 1 comments
gesy17 commented
Hi,
I tried to reproduce the training process from sft to dpo. I ran the run_loop.sh script, the only change I made is setting initial_model="RLHFlow/LLaMA3-SFT". After 3 iterations, the final model checkpoint of iteration 3 has a mtbench score of 7.95, which is different than the reported number. The initial sft start point has the same mtbench score as the paper reported.
I did not modify other settings in run_loop.sh script. Please let me know if there is anything additional needed to reproduce.
hendrydong commented
Can you use bsz=128
instead? lr=5e-7
might be too large for bsz=32
.