Question bout IPO loss vs DPO loss
MoonBlvd opened this issue · 1 comments
MoonBlvd commented
Thanks for the great work!
I'm looking at the IPO loss and DPO losses here:
pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = reference_chosen_logps - reference_rejected_logps
if reference_free:
ref_logratios = 0
logits = pi_logratios - ref_logratios # also known as h_{\pi_\theta}^{y_w,y_l}
if ipo:
losses = (logits - 1/(2 * beta)) ** 2 # Eq. 17 of https://arxiv.org/pdf/2310.12036v2.pdf
else:
# Eq. 3 https://ericmitchell.ai/cdpo.pdf; label_smoothing=0 gives original DPO (Eq. 7 of https://arxiv.org/pdf/2305.18290.pdf)
losses = -F.logsigmoid(beta * logits) * (1 - label_smoothing) - F.logsigmoid(-beta * logits) * label_smoothing
chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps).detach()
rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach()
return losses, chosen_rewards, rejected_rewards
is it correct to minimize losses = (logits - 1/(2 * beta)) ** 2
?
wouldn't this minimize policy_chosen_logps
and maximize policy_rejected_logps
?
Seems your implementation is the same to the Algorithm 1 in the original IPO paper, just in case the original paper also made a mistake.
yata0 commented
The IPO loss means to minimize the distance between logits and 1/(2*beta), rather than minimize the logits. You can check the gradients of IPO loss and DPO loss.