RLHF-V/RLAIF-V

About optimizer setting in Iterative Alignment

davidluciolu opened this issue · 2 comments

As you mentioned in Implementation Details, you train the model for 4 epoches with learning rate 5e-7.

I want to know that in each round of DPO, the optimizer is newly created or just load from the latest checkpoint?

For example, in the epoch 2, is the learning rate reset to 5e-7? Or it continues from the end of the epoch 1?

Hi @davidluciolu, thank you for your interest in our work!

As we mentioned in our paper, we train our model for 4 iterations, and in each iteration, the model is optimized in 4 epochs. In each training epoch, we do not reset the learning rate. In a single iteration that contains 4 epochs, we use 5e-7 as the peak learning rate with a 5% warm-up ratio and a cosine scheduler. And for each iteration, the optimizer will be newly created.

If you have any other questions, we are willing to help!

Thanks for your reply!