Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My Code to Reproduce IMDB #69

Open
QiyaoWei opened this issue Feb 26, 2024 · 0 comments
Open

My Code to Reproduce IMDB #69

QiyaoWei opened this issue Feb 26, 2024 · 0 comments

Comments

@QiyaoWei
Copy link

In response to #54 #45 #35 #28 #27 and many related questions, I am open-sourcing my code to reproduce the IMDB experiments here

kl_vs_rewards

Remark: Why do the rewards not match the paper exactly? I reckon it has something to do with me choosing a lazy ref model. IMHO, what would be the most correct procedure is to (1) generate the preference dataset (2) run SFT on the positive samples (3) run DPO. This is because SFT essentially pushes the model closer to the data distribution, and then it makes sense to do a KL constraint because we don't want the generation to become far away from the data distribution. Any other method may incur an OOD problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant