Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not convergent in custom dataset. #6

Open
jcao-ai opened this issue Jun 7, 2023 · 9 comments
Open

Not convergent in custom dataset. #6

jcao-ai opened this issue Jun 7, 2023 · 9 comments

Comments

@jcao-ai
Copy link

jcao-ai commented Jun 7, 2023

Hi, glad to see the impressive project.

I overload the trainer.py according to README and the training works properly. However, the model doesn't seem to converge and loss is always around 0.7 no matter how many epochs it trains on. BTW, I use LLAMA but not OPT.

As a comparison, if I train the same dataset with fully-finetuning, everything works fine and loss comes to less than 0.1 immediately.

So is there some constraint that may lead to the failure of training?

Screenshot 2023-06-07 at 13 27 26
@gaotianyu1350
Copy link
Member

Hi,

Thanks for your interest in our project! Can you specify the setting (task, model size, hyperparameters like steps, learning rate, eps, etc)? Also, did you use our codebase and replace OPT with LLaMA or you copied the MeZO part to other codebases?

@jcao-ai
Copy link
Author

jcao-ai commented Jun 8, 2023

Hi,

Thanks for your interest in our project! Can you specify the setting (task, model size, hyperparameters like steps, learning rate, eps, etc)? Also, did you use our codebase and replace OPT with LLaMA or you copied the MeZO part to other codebases?

Thanks for reply. I integrate your trainer in my own codebase.(LLAMA-7B) lr:2e-5, steps: 3000, eps:1e-3

@gaotianyu1350
Copy link
Member

Hi,

I believe the learning rate is too large. I suggest starting from LR=1e-6, EPS=1e-3, and tune the hyperparameter using grid search. Also the number of steps needed for convergence depends on the task. I'll suggest at least try 5,000 steps. All the OPT experiments in our paper used 20,000 steps.

@nousr
Copy link

nousr commented Jun 8, 2023

I'll suggest at least try 5,000 steps. All the OPT experiments in our paper used 20,000 steps.

@gaotianyu1350 how many steps was used in the paper when not using mezo?

i'm running a similar experiment with LLAMA 7B and having trouble getting the model to converge (can share results later today), really curious though to know how many more steps was needed to finetune OPT.

thanks!

@lramming
Copy link

FYI, I have done some experiments on a custom dataset with OPT-125m and LORA and I observed the same problem of the loss not going down. I had to resort to really high learning rates and for the first time I see the loss going down. Probably it's too high now, since I see the los not going down smoothly but rather stepwise. I am using

  • lr: 4e-2, cosine lr schedule
  • zo_eps: 5e-3,
  • batch size of 32
  • ~300k trainable parameters

The loss reported by deepspeed
{'loss': 16.2535, 'learning_rate': 0.03999983126569921, 'epoch': 0.0}
{'loss': 15.5859, 'learning_rate': 0.039998481408375613, 'epoch': 0.0}
{'loss': 15.8855, 'learning_rate': 0.03999578178483493, 'epoch': 0.0}
{'loss': 15.832, 'learning_rate': 0.039991732577284014, 'epoch': 0.01}
{'loss': 16.0207, 'learning_rate': 0.0399863340590178, 'epoch': 0.01}
{'loss': 16.0691, 'learning_rate': 0.03997958659440085, 'epoch': 0.01}
{'loss': 15.9879, 'learning_rate': 0.03997149063884271, 'epoch': 0.01}
{'loss': 15.766, 'learning_rate': 0.03996204673876726, 'epoch': 0.01}
{'loss': 14.9773, 'learning_rate': 0.03995125553157573, 'epoch': 0.01}
{'loss': 7.8742, 'learning_rate': 0.03993911774560379, 'epoch': 0.01}
{'loss': 8.3316, 'learning_rate': 0.039925634200072314, 'epoch': 0.01}
{'loss': 7.8898, 'learning_rate': 0.039910805805032104, 'epoch': 0.02}
{'loss': 8.0785, 'learning_rate': 0.039894633561302496, 'epoch': 0.02}
{'loss': 8.0207, 'learning_rate': 0.039877118560403775, 'epoch': 0.02}
{'loss': 7.818, 'learning_rate': 0.03985826198448353, 'epoch': 0.02}
{'loss': 8.1402, 'learning_rate': 0.039838065106236845, 'epoch': 0.02}
{'loss': 7.884, 'learning_rate': 0.039816529288820436, 'epoch': 0.02}
{'loss': 7.8727, 'learning_rate': 0.039793655985760595, 'epoch': 0.02}
{'loss': 7.9, 'learning_rate': 0.039769446740855134, 'epoch': 0.02}
{'loss': 7.6941, 'learning_rate': 0.03974390318806917, 'epoch': 0.03}
{'loss': 7.8531, 'learning_rate': 0.039717027051424825, 'epoch': 0.03}
{'loss': 7.8488, 'learning_rate': 0.03968882014488491, 'epoch': 0.03}
{'loss': 7.8055, 'learning_rate': 0.03965928437223045, 'epoch': 0.03}
{'loss': 7.8672, 'learning_rate': 0.03962842172693222, 'epoch': 0.03}
{'loss': 7.9863, 'learning_rate': 0.03959623429201618, 'epoch': 0.03}
{'loss': 7.8461, 'learning_rate': 0.03956272423992289, 'epoch': 0.03}
{'loss': 7.9762, 'learning_rate': 0.03952789383236089, 'epoch': 0.04}

So it will take some time until it converges but the loss is going down which is nice

@gaotianyu1350
Copy link
Member

@nousr You can refer to our Appendix D for steps used in each experiment.

@lramming Can you specify which dataset is this? Note that two key points to make MeZO work (1) always using prompts, (2) longer training time. All our OPT experiments use 20K steps, though you should expect to see significant performance improvement with 5k steps.

@lramming
Copy link

Unfortunately, I can't comment on the dataset. At the end, unfortunately it did not converge but remained at a loss of ~7, which is a lot higher than training on normal optimisers. However, I am fairly certain that it is more a problem of choosing the correct hyper parameters than an issue with the actual algorithm.
I did find out that it is highly dependent on the choice of zo_eps and the learning rate; in some experiments going a bit higher with zo_ops actually improved how the loss went down. I tried some experiments normalising the gradient before updating the parameters and it helped improve the convergence in some cases but ultimately it failed. Either the loss hovered around some high value without going down or it experienced a problem and went to 0.

I also suspect that the current implementation is not working well with Deepspeed; you can switch off the deepspeed optimiser by removing the 'optimiser' part of the deepspeed config and specifying "zero_force_ds_cpu_optimizer": false.

Probably doing a proper sweep across the hyperparameters that are useful for the given dataset is a good idea, maybe I have some time to implement this in the future.

@dittops
Copy link

dittops commented Jun 22, 2023

The same issue for me on LLaMa 7B, the loss was not reducing. I have used LR=1e-6, and EPS=1e-3 with 8600 steps.

@gaotianyu1350
Copy link
Member

Hi, not sure if there is any update. But I recently realized I gave a wrong hyper parameter in the README. For example, OPT-13B + SST-2 should use 1e-7/1e-3. So I would suggest trying more hyper parameter tuning (especially LR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants