eric-mitchell/direct-preference-optimization

Question bout IPO loss vs DPO loss

MoonBlvd opened this issue · 1 comments

Thanks for the great work!

I'm looking at the IPO loss and DPO losses here:

    pi_logratios = policy_chosen_logps - policy_rejected_logps
    ref_logratios = reference_chosen_logps - reference_rejected_logps

    if reference_free:
        ref_logratios = 0

    logits = pi_logratios - ref_logratios  # also known as h_{\pi_\theta}^{y_w,y_l}

    if ipo:
        losses = (logits - 1/(2 * beta)) ** 2  # Eq. 17 of https://arxiv.org/pdf/2310.12036v2.pdf
    else:
        # Eq. 3 https://ericmitchell.ai/cdpo.pdf; label_smoothing=0 gives original DPO (Eq. 7 of https://arxiv.org/pdf/2305.18290.pdf)
        losses = -F.logsigmoid(beta * logits) * (1 - label_smoothing) - F.logsigmoid(-beta * logits) * label_smoothing

    chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps).detach()
    rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach()

    return losses, chosen_rewards, rejected_rewards

is it correct to minimize losses = (logits - 1/(2 * beta)) ** 2?
wouldn't this minimize policy_chosen_logps and maximize policy_rejected_logps?
Seems your implementation is the same to the Algorithm 1 in the original IPO paper, just in case the original paper also made a mistake.

The IPO loss means to minimize the distance between logits and 1/(2*beta), rather than minimize the logits. You can check the gradients of IPO loss and DPO loss.