How to construct new pairs for adding to the dataset

Question

How to construct new pairs for adding to the dataset

Closed this issue 7 months ago · 1 comments

Thanks for your great work! I have some questions about how to add new pairs to the training set.
According to your paper:

Instead of fixing ... , we take $\pi_t^1$ and $\pi_t^2$ as the best-of-8 policy and worst-of-8 policy induced by $\pi_t^{MLE}$

In my view, the norm process is as follow:
Suppose we have a DPO model after training in the $t$-th iteration as $\pi_t^{MLE}$.
We use different temperatures (0.7 for $\pi_t^1$ and 1.0 for $\pi_t^2$ to allow more exploration).
Then, we sample the best-of-8 for $\pi_t^2$ and rank them using the Reward Model.
Finally, we use the top-1 of this set and the generation result in $\pi_t^1$ as $(a_{t,i}^1, a_{t,i}^2)$.

But how do I use best-of-8 and worst-of-8 to construct pair like that?

Answer 1 · 2024-05-17T03:21:57.000Z

With the DPO model after training in the t iteration, we use this policy to sample 8 responses per prompt where we tune the temperature in the generation process (e.g., get 4 with 0.7 and 4 with 1.0). Then, we simply take the one with highest reward as the accepted one, and the one with the lowest reward as the rejected one.

The key idea here is that we hope that the two responses are sampled by some policy around the DPO model (to use the historical information we get so far), while in the mean tile, we hope they are diverse enough and enjoy a relatively larger margin to facilitate exploration. The rejection sampling (best-of-n and worst-of-n) is very popular in the literature and works pretty well. Though I think there are still many works to do for a more effective exploration strategy.