RLHFlow/Online-RLHF

One question about the loss function given a gold reward model

srzer opened this issue · 2 comments

As illustrated, the gold reward model like sfairXC/FsfairX-LLaMA3-RM-v0.1 is trained following BT model and performs very well, then one question naturally arises that why the DPO loss function is still
$\log \pi(y_1)\pi_{ref}(y_2)/\pi_{ref}(y_1)\pi(y_2)$, instead of the real BT model, which is expected to be $\sigma(r_1-r_2)\log \pi(y_1)\pi_{ref}(y_2)/\pi_{ref}(y_1)\pi(y_2)+\sigma(r_2-r_1)\log \pi(y_2)\pi_{ref}(y_1)/\pi_{ref}(y_2)\pi(y_1)$ (omit hyper-parameter here). Could you provide some intuitions?

@hendrydong may have some insights on this.

As Hendry pointed out, this loss has already been implemented in https://github.com/RLHFlow/Online-RLHF/blob/main/dpo_iteration/dpo.py#L344. Issue resolved.