Question bout IPO loss vs DPO loss #64

MoonBlvd · 2024-01-30T05:53:25Z

Thanks for the great work!

I'm looking at the IPO loss and DPO losses here:

    pi_logratios = policy_chosen_logps - policy_rejected_logps
    ref_logratios = reference_chosen_logps - reference_rejected_logps

    if reference_free:
        ref_logratios = 0

    logits = pi_logratios - ref_logratios  # also known as h_{\pi_\theta}^{y_w,y_l}

    if ipo:
        losses = (logits - 1/(2 * beta)) ** 2  # Eq. 17 of https://arxiv.org/pdf/2310.12036v2.pdf
    else:
        # Eq. 3 https://ericmitchell.ai/cdpo.pdf; label_smoothing=0 gives original DPO (Eq. 7 of https://arxiv.org/pdf/2305.18290.pdf)
        losses = -F.logsigmoid(beta * logits) * (1 - label_smoothing) - F.logsigmoid(-beta * logits) * label_smoothing

    chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps).detach()
    rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach()

    return losses, chosen_rewards, rejected_rewards

is it correct to minimize losses = (logits - 1/(2 * beta)) ** 2?
wouldn't this minimize policy_chosen_logps and maximize policy_rejected_logps?
Seems your implementation is the same to the Algorithm 1 in the original IPO paper, just in case the original paper also made a mistake.

The text was updated successfully, but these errors were encountered:

yata0 · 2024-04-02T09:37:36Z

The IPO loss means to minimize the distance between logits and 1/(2*beta), rather than minimize the logits. You can check the gradients of IPO loss and DPO loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question bout IPO loss vs DPO loss #64

Question bout IPO loss vs DPO loss #64

MoonBlvd commented Jan 30, 2024

yata0 commented Apr 2, 2024

Question bout IPO loss vs DPO loss #64

Question bout IPO loss vs DPO loss #64

Comments

MoonBlvd commented Jan 30, 2024

yata0 commented Apr 2, 2024