Not convergent in custom dataset. #6

jcao-ai · 2023-06-07T05:30:55Z

Hi, glad to see the impressive project.

I overload the trainer.py according to README and the training works properly. However, the model doesn't seem to converge and loss is always around 0.7 no matter how many epochs it trains on. BTW, I use LLAMA but not OPT.

As a comparison, if I train the same dataset with fully-finetuning, everything works fine and loss comes to less than 0.1 immediately.

So is there some constraint that may lead to the failure of training?

The text was updated successfully, but these errors were encountered:

gaotianyu1350 · 2023-06-07T16:23:43Z

Hi,

Thanks for your interest in our project! Can you specify the setting (task, model size, hyperparameters like steps, learning rate, eps, etc)? Also, did you use our codebase and replace OPT with LLaMA or you copied the MeZO part to other codebases?

jcao-ai · 2023-06-08T13:20:56Z

Hi,

Thanks for your interest in our project! Can you specify the setting (task, model size, hyperparameters like steps, learning rate, eps, etc)? Also, did you use our codebase and replace OPT with LLaMA or you copied the MeZO part to other codebases?

Thanks for reply. I integrate your trainer in my own codebase.(LLAMA-7B) lr:2e-5, steps: 3000, eps:1e-3

gaotianyu1350 · 2023-06-08T17:20:07Z

Hi,

I believe the learning rate is too large. I suggest starting from LR=1e-6, EPS=1e-3, and tune the hyperparameter using grid search. Also the number of steps needed for convergence depends on the task. I'll suggest at least try 5,000 steps. All the OPT experiments in our paper used 20,000 steps.

nousr · 2023-06-08T19:32:48Z

I'll suggest at least try 5,000 steps. All the OPT experiments in our paper used 20,000 steps.

@gaotianyu1350 how many steps was used in the paper when not using mezo?

i'm running a similar experiment with LLAMA 7B and having trouble getting the model to converge (can share results later today), really curious though to know how many more steps was needed to finetune OPT.

thanks!

lramming · 2023-06-12T07:56:07Z

FYI, I have done some experiments on a custom dataset with OPT-125m and LORA and I observed the same problem of the loss not going down. I had to resort to really high learning rates and for the first time I see the loss going down. Probably it's too high now, since I see the los not going down smoothly but rather stepwise. I am using

lr: 4e-2, cosine lr schedule
zo_eps: 5e-3,
batch size of 32
~300k trainable parameters

The loss reported by deepspeed
{'loss': 16.2535, 'learning_rate': 0.03999983126569921, 'epoch': 0.0}
{'loss': 15.5859, 'learning_rate': 0.039998481408375613, 'epoch': 0.0}
{'loss': 15.8855, 'learning_rate': 0.03999578178483493, 'epoch': 0.0}
{'loss': 15.832, 'learning_rate': 0.039991732577284014, 'epoch': 0.01}
{'loss': 16.0207, 'learning_rate': 0.0399863340590178, 'epoch': 0.01}
{'loss': 16.0691, 'learning_rate': 0.03997958659440085, 'epoch': 0.01}
{'loss': 15.9879, 'learning_rate': 0.03997149063884271, 'epoch': 0.01}
{'loss': 15.766, 'learning_rate': 0.03996204673876726, 'epoch': 0.01}
{'loss': 14.9773, 'learning_rate': 0.03995125553157573, 'epoch': 0.01}
{'loss': 7.8742, 'learning_rate': 0.03993911774560379, 'epoch': 0.01}
{'loss': 8.3316, 'learning_rate': 0.039925634200072314, 'epoch': 0.01}
{'loss': 7.8898, 'learning_rate': 0.039910805805032104, 'epoch': 0.02}
{'loss': 8.0785, 'learning_rate': 0.039894633561302496, 'epoch': 0.02}
{'loss': 8.0207, 'learning_rate': 0.039877118560403775, 'epoch': 0.02}
{'loss': 7.818, 'learning_rate': 0.03985826198448353, 'epoch': 0.02}
{'loss': 8.1402, 'learning_rate': 0.039838065106236845, 'epoch': 0.02}
{'loss': 7.884, 'learning_rate': 0.039816529288820436, 'epoch': 0.02}
{'loss': 7.8727, 'learning_rate': 0.039793655985760595, 'epoch': 0.02}
{'loss': 7.9, 'learning_rate': 0.039769446740855134, 'epoch': 0.02}
{'loss': 7.6941, 'learning_rate': 0.03974390318806917, 'epoch': 0.03}
{'loss': 7.8531, 'learning_rate': 0.039717027051424825, 'epoch': 0.03}
{'loss': 7.8488, 'learning_rate': 0.03968882014488491, 'epoch': 0.03}
{'loss': 7.8055, 'learning_rate': 0.03965928437223045, 'epoch': 0.03}
{'loss': 7.8672, 'learning_rate': 0.03962842172693222, 'epoch': 0.03}
{'loss': 7.9863, 'learning_rate': 0.03959623429201618, 'epoch': 0.03}
{'loss': 7.8461, 'learning_rate': 0.03956272423992289, 'epoch': 0.03}
{'loss': 7.9762, 'learning_rate': 0.03952789383236089, 'epoch': 0.04}

So it will take some time until it converges but the loss is going down which is nice

gaotianyu1350 · 2023-06-13T17:37:26Z

@nousr You can refer to our Appendix D for steps used in each experiment.

@lramming Can you specify which dataset is this? Note that two key points to make MeZO work (1) always using prompts, (2) longer training time. All our OPT experiments use 20K steps, though you should expect to see significant performance improvement with 5k steps.

lramming · 2023-06-14T06:21:22Z

Unfortunately, I can't comment on the dataset. At the end, unfortunately it did not converge but remained at a loss of ~7, which is a lot higher than training on normal optimisers. However, I am fairly certain that it is more a problem of choosing the correct hyper parameters than an issue with the actual algorithm.
I did find out that it is highly dependent on the choice of zo_eps and the learning rate; in some experiments going a bit higher with zo_ops actually improved how the loss went down. I tried some experiments normalising the gradient before updating the parameters and it helped improve the convergence in some cases but ultimately it failed. Either the loss hovered around some high value without going down or it experienced a problem and went to 0.

I also suspect that the current implementation is not working well with Deepspeed; you can switch off the deepspeed optimiser by removing the 'optimiser' part of the deepspeed config and specifying "zero_force_ds_cpu_optimizer": false.

Probably doing a proper sweep across the hyperparameters that are useful for the given dataset is a good idea, maybe I have some time to implement this in the future.

dittops · 2023-06-22T16:05:01Z

The same issue for me on LLaMa 7B, the loss was not reducing. I have used LR=1e-6, and EPS=1e-3 with 8600 steps.

gaotianyu1350 · 2023-08-02T15:01:49Z

Hi, not sure if there is any update. But I recently realized I gave a wrong hyper parameter in the README. For example, OPT-13B + SST-2 should use 1e-7/1e-3. So I would suggest trying more hyper parameter tuning (especially LR)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not convergent in custom dataset. #6

Not convergent in custom dataset. #6

jcao-ai commented Jun 7, 2023 •

edited

Loading

gaotianyu1350 commented Jun 7, 2023

jcao-ai commented Jun 8, 2023

gaotianyu1350 commented Jun 8, 2023

nousr commented Jun 8, 2023

lramming commented Jun 12, 2023

gaotianyu1350 commented Jun 13, 2023

lramming commented Jun 14, 2023

dittops commented Jun 22, 2023

gaotianyu1350 commented Aug 2, 2023

Not convergent in custom dataset. #6

Not convergent in custom dataset. #6

Comments

jcao-ai commented Jun 7, 2023 • edited Loading

gaotianyu1350 commented Jun 7, 2023

jcao-ai commented Jun 8, 2023

gaotianyu1350 commented Jun 8, 2023

nousr commented Jun 8, 2023

lramming commented Jun 12, 2023

gaotianyu1350 commented Jun 13, 2023

lramming commented Jun 14, 2023

dittops commented Jun 22, 2023

gaotianyu1350 commented Aug 2, 2023

jcao-ai commented Jun 7, 2023 •

edited

Loading