One question about the loss function given a gold reward model
srzer opened this issue · 2 comments
srzer commented
As illustrated, the gold reward model like sfairXC/FsfairX-LLaMA3-RM-v0.1 is trained following BT model and performs very well, then one question naturally arises that why the DPO loss function is still
WeiXiongUST commented
@hendrydong may have some insights on this.
srzer commented
As Hendry pointed out, this loss has already been implemented in https://github.com/RLHFlow/Online-RLHF/blob/main/dpo_iteration/dpo.py#L344. Issue resolved.