You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In response to #54#45#35#28#27 and many related questions, I am open-sourcing my code to reproduce the IMDB experiments here
Remark: Why do the rewards not match the paper exactly? I reckon it has something to do with me choosing a lazy ref model. IMHO, what would be the most correct procedure is to (1) generate the preference dataset (2) run SFT on the positive samples (3) run DPO. This is because SFT essentially pushes the model closer to the data distribution, and then it makes sense to do a KL constraint because we don't want the generation to become far away from the data distribution. Any other method may incur an OOD problem.
The text was updated successfully, but these errors were encountered:
In response to #54 #45 #35 #28 #27 and many related questions, I am open-sourcing my code to reproduce the IMDB experiments here
Remark: Why do the rewards not match the paper exactly? I reckon it has something to do with me choosing a lazy ref model. IMHO, what would be the most correct procedure is to (1) generate the preference dataset (2) run SFT on the positive samples (3) run DPO. This is because SFT essentially pushes the model closer to the data distribution, and then it makes sense to do a KL constraint because we don't want the generation to become far away from the data distribution. Any other method may incur an OOD problem.
The text was updated successfully, but these errors were encountered: