-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not convergent in custom dataset. #6
Comments
Hi, Thanks for your interest in our project! Can you specify the setting (task, model size, hyperparameters like steps, learning rate, eps, etc)? Also, did you use our codebase and replace OPT with LLaMA or you copied the MeZO part to other codebases? |
Thanks for reply. I integrate your trainer in my own codebase.(LLAMA-7B) lr:2e-5, steps: 3000, eps:1e-3 |
Hi, I believe the learning rate is too large. I suggest starting from LR=1e-6, EPS=1e-3, and tune the hyperparameter using grid search. Also the number of steps needed for convergence depends on the task. I'll suggest at least try 5,000 steps. All the OPT experiments in our paper used 20,000 steps. |
@gaotianyu1350 how many steps was used in the paper when not using mezo? i'm running a similar experiment with LLAMA 7B and having trouble getting the model to converge (can share results later today), really curious though to know how many more steps was needed to finetune OPT. thanks! |
FYI, I have done some experiments on a custom dataset with OPT-125m and LORA and I observed the same problem of the loss not going down. I had to resort to really high learning rates and for the first time I see the loss going down. Probably it's too high now, since I see the los not going down smoothly but rather stepwise. I am using
The loss reported by deepspeed So it will take some time until it converges but the loss is going down which is nice |
@nousr You can refer to our Appendix D for steps used in each experiment. @lramming Can you specify which dataset is this? Note that two key points to make MeZO work (1) always using prompts, (2) longer training time. All our OPT experiments use 20K steps, though you should expect to see significant performance improvement with 5k steps. |
Unfortunately, I can't comment on the dataset. At the end, unfortunately it did not converge but remained at a loss of ~7, which is a lot higher than training on normal optimisers. However, I am fairly certain that it is more a problem of choosing the correct hyper parameters than an issue with the actual algorithm. I also suspect that the current implementation is not working well with Deepspeed; you can switch off the deepspeed optimiser by removing the 'optimiser' part of the deepspeed config and specifying Probably doing a proper sweep across the hyperparameters that are useful for the given dataset is a good idea, maybe I have some time to implement this in the future. |
The same issue for me on LLaMa 7B, the loss was not reducing. I have used LR=1e-6, and EPS=1e-3 with 8600 steps. |
Hi, not sure if there is any update. But I recently realized I gave a wrong hyper parameter in the README. For example, OPT-13B + SST-2 should use 1e-7/1e-3. So I would suggest trying more hyper parameter tuning (especially LR) |
Hi, glad to see the impressive project.
I overload the
trainer.py
according toREADME
and the training works properly. However, the model doesn't seem to converge and loss is always around0.7
no matter how many epochs it trains on. BTW, I use LLAMA but not OPT.As a comparison, if I train the same dataset with fully-finetuning, everything works fine and loss comes to less than 0.1 immediately.
So is there some constraint that may lead to the failure of training?
The text was updated successfully, but these errors were encountered: