Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question bout IPO loss vs DPO loss #64

Open
MoonBlvd opened this issue Jan 30, 2024 · 1 comment
Open

Question bout IPO loss vs DPO loss #64

MoonBlvd opened this issue Jan 30, 2024 · 1 comment

Comments

@MoonBlvd
Copy link

Thanks for the great work!

I'm looking at the IPO loss and DPO losses here:

    pi_logratios = policy_chosen_logps - policy_rejected_logps
    ref_logratios = reference_chosen_logps - reference_rejected_logps

    if reference_free:
        ref_logratios = 0

    logits = pi_logratios - ref_logratios  # also known as h_{\pi_\theta}^{y_w,y_l}

    if ipo:
        losses = (logits - 1/(2 * beta)) ** 2  # Eq. 17 of https://arxiv.org/pdf/2310.12036v2.pdf
    else:
        # Eq. 3 https://ericmitchell.ai/cdpo.pdf; label_smoothing=0 gives original DPO (Eq. 7 of https://arxiv.org/pdf/2305.18290.pdf)
        losses = -F.logsigmoid(beta * logits) * (1 - label_smoothing) - F.logsigmoid(-beta * logits) * label_smoothing

    chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps).detach()
    rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach()

    return losses, chosen_rewards, rejected_rewards

is it correct to minimize losses = (logits - 1/(2 * beta)) ** 2?
wouldn't this minimize policy_chosen_logps and maximize policy_rejected_logps?
Seems your implementation is the same to the Algorithm 1 in the original IPO paper, just in case the original paper also made a mistake.

@yata0
Copy link

yata0 commented Apr 2, 2024

The IPO loss means to minimize the distance between logits and 1/(2*beta), rather than minimize the logits. You can check the gradients of IPO loss and DPO loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants