You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thanks for your great work!
I'm confused as you said you use the SFT model or Preferred-FT model as the reference policy when operating DPO training.
But for Preferred-FT in Figure 2, what's its reference policy? Or how the KL-Divergence is computed, Is the reference policy aligned?
The text was updated successfully, but these errors were encountered:
First of all, thanks for your great work!
I'm confused as you said you use the SFT model or Preferred-FT model as the reference policy when operating DPO training.
But for Preferred-FT in Figure 2, what's its reference policy? Or how the KL-Divergence is computed, Is the reference policy aligned?
The text was updated successfully, but these errors were encountered: