How to construct new pairs for adding to the dataset
Closed this issue · 1 comments
Thanks for your great work! I have some questions about how to add new pairs to the training set.
According to your paper:
Instead of fixing ... , we take
$\pi_t^1$ and$\pi_t^2$ as the best-of-8 policy and worst-of-8 policy induced by$\pi_t^{MLE}$
In my view, the norm process is as follow:
Suppose we have a DPO model after training in the
We use different temperatures (0.7 for
Then, we sample the best-of-8 for
Finally, we use the top-1 of this set and the generation result in
But how do I use best-of-8 and worst-of-8 to construct pair like that?
With the DPO model after training in the t iteration, we use this policy to sample 8 responses per prompt where we tune the temperature in the generation process (e.g., get 4 with 0.7 and 4 with 1.0). Then, we simply take the one with highest reward as the accepted one, and the one with the lowest reward as the rejected one.
The key idea here is that we hope that the two responses are sampled by some policy around the DPO model (to use the historical information we get so far), while in the mean tile, we hope they are diverse enough and enjoy a relatively larger margin to facilitate exploration. The rejection sampling (best-of-n and worst-of-n) is very popular in the literature and works pretty well. Though I think there are still many works to do for a more effective exploration strategy.