Weird logits and model starts degeneration while training DPO #77

DungNasSa10 · 2024-04-09T07:56:38Z

Recently, I have experimented DPO training for Vietnamese. I start with a strong SFT model, which is vinai/PhoGPT-4B-Chat, and follow the method described in CHEN, Zixiang, et al. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024. to make preference dataset from my own SFT dataset. I use trl for traninig with the config:

Deepspeed zero 3 offload
beta = 0.1
global_batch_size 128
learning_rate 1e-6
learning_rate_scheduler cosine
optim adam_torch
bf16
While training, the loss decreases very fast but after the first epoch, the logits of both chosen and rejected decreases to 0 and model suffer from degeneration (it generates repeated character `) after 1 epoch.
Here is the full logs of the training process and a sample output of model, you can read more in column "PhoGPT-4B-Chat-SPIN-0-4K-one-turn-ep1" in the attached google sheet:

Do you have any suggest for this problem?

AGTSAAA · 2024-05-08T04:34:15Z

Hi, Did you solve the problem?

ggoggam · 2024-05-22T05:11:44Z

This seems to be a problem with DeepSpeed ZeRO 3. If I use FSDP, everything works fine.

I tried using torch's AdamW instead of DS FusedAdam, the problem persists.

Provide feedback