You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
is it correct to minimize losses = (logits - 1/(2 * beta)) ** 2?
wouldn't this minimize policy_chosen_logps and maximize policy_rejected_logps?
Seems your implementation is the same to the Algorithm 1 in the original IPO paper, just in case the original paper also made a mistake.
The text was updated successfully, but these errors were encountered:
The IPO loss means to minimize the distance between logits and 1/(2*beta), rather than minimize the logits. You can check the gradients of IPO loss and DPO loss.
Thanks for the great work!
I'm looking at the IPO loss and DPO losses here:
is it correct to minimize
losses = (logits - 1/(2 * beta)) ** 2
?wouldn't this minimize
policy_chosen_logps
and maximizepolicy_rejected_logps
?Seems your implementation is the same to the Algorithm 1 in the original IPO paper, just in case the original paper also made a mistake.
The text was updated successfully, but these errors were encountered: