Cannot Reproduce the DPO Checkpoint

Question

Cannot Reproduce the DPO Checkpoint

Closed this issue 8 months ago · 1 comments

Hi,

I tried to reproduce the training process from sft to dpo. I ran the run_loop.sh script, the only change I made is setting initial_model="RLHFlow/LLaMA3-SFT". After 3 iterations, the final model checkpoint of iteration 3 has a mtbench score of 7.95, which is different than the reported number. The initial sft start point has the same mtbench score as the paper reported.

I did not modify other settings in run_loop.sh script. Please let me know if there is anything additional needed to reproduce.

Answer 1 · 2024-05-25T00:50:55.000Z

Can you use bsz=128 instead? lr=5e-7 might be too large for bsz=32.