One question about the loss function given a gold reward model

Question

One question about the loss function given a gold reward model

srzer opened this issue 9 months ago · 2 comments

As illustrated, the gold reward model like sfairXC/FsfairX-LLaMA3-RM-v0.1 is trained following BT model and performs very well, then one question naturally arises that why the DPO loss function is still
$\log \pi(y_1)\pi_{ref}(y_2)/\pi_{ref}(y_1)\pi(y_2)$, instead of the real BT model, which is expected to be $\sigma(r_1-r_2)\log \pi(y_1)\pi_{ref}(y_2)/\pi_{ref}(y_1)\pi(y_2)+\sigma(r_2-r_1)\log \pi(y_2)\pi_{ref}(y_1)/\pi_{ref}(y_2)\pi(y_1)$ (omit hyper-parameter here). Could you provide some intuitions?

Answer 1 · 2024-06-28T01:46:15.000Z

@hendrydong may have some insights on this.

Answer 2 · 2024-07-02T01:22:32.000Z

As Hendry pointed out, this loss has already been implemented in https://github.com/RLHFlow/Online-RLHF/blob/main/dpo_iteration/dpo.py#L344. Issue resolved.