-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPIN == DPO in self-iteration? #26
Comments
why not combine dpo and spin? Put the previous generation into the rejected column and the new generation into the accepted one. Then train with DPO at each iteration. (or ORPO) |
I believe that is what I post here: SPIN == DPO in self-iteration |
DPO relies on the Bradley-Terry (BT) mode or the more general Plackett-Luce models, matching outcomes of pairwise comparisons directly with an implicit reward model. Therefore, the core DPO methodology does not inherently lead to iterative training. On the other hand, SPIN relies on selfplay to compete with an increasingly stronger self. Therefore, the SPIN’s self-play mechanism naturally leads to an iterative training dynamic. Despite converging to similar outcomes, the foundational difference leads to distinct practical scenarios. The following are some key resulting differences:
We also want to clarify that SPIN requires only the SFT dataset without any external supervision such as preference. Therefore, the most relevant baseline for a fair comparison is the standard SFT method. Figure 3 you posted is to emphasize the importance of fully utilizing SFT data. The training data for the DPO baseline and SPIN in this figure are different. In particular, the DPO baseline zephyr-7b-beta is a model trained with DPO on approximately 62k new preference data from the UltraFeedback Binarized dataset (Cui et al., 2023), different from the SFT dataset. Meanwhile, our method only leverages the SFT dataset. This is one of the most significant distinctions between SPIN and DPO, while both start from an SFT model, is the elimination of the requirement for preference labeling and additional data other than SFT. If only the SFT dataset were available, it would not be possible to apply DPO, while SPIN works effectively. |
so, if i understand right, can i say if i use a dpo human-annotated dataset as "real-generate" dataset, and keep the loss function being logsigmoid, the iter-0 spin is equivalent to the dpo method? |
I wonder if SPIN could replace SFT, or it must start from an SFT(not base) model? To be a process between SFT and DPO? |
The following part in the paper explains the difference of SPIN and DPO.
It claims that DPO improve the model using instance level information while SPIN are on the distribution level.
However, referring to the formulas respectively, the difference is minor when the SFT dataset in SPIN y~P_data is regarded as the winner y_w in DPO and the LLM outputs in SPIN y~P_theta is regarded as the loser y_l in DPO.
How can you explain this?
The text was updated successfully, but these errors were encountered: