My Code to Reproduce IMDB #69

QiyaoWei · 2024-02-26T12:04:45Z

In response to #54 #45 #35 #28 #27 and many related questions, I am open-sourcing my code to reproduce the IMDB experiments here

Remark: Why do the rewards not match the paper exactly? I reckon it has something to do with me choosing a lazy ref model. IMHO, what would be the most correct procedure is to (1) generate the preference dataset (2) run SFT on the positive samples (3) run DPO. This is because SFT essentially pushes the model closer to the data distribution, and then it makes sense to do a KL constraint because we don't want the generation to become far away from the data distribution. Any other method may incur an OOD problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My Code to Reproduce IMDB #69

My Code to Reproduce IMDB #69

QiyaoWei commented Feb 26, 2024

My Code to Reproduce IMDB #69

My Code to Reproduce IMDB #69

Comments

QiyaoWei commented Feb 26, 2024