-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Which trainer to use #10
Comments
Issue above solved. I figured that you have
I'm using MeZO to try and fully finetune a moel. Usually the batch size used for my task is |
Hi, Glad that you made it work! For gradient accumulation, we did not implement it because in our preliminary experiments, we found that instead of doing gradient accumulation, using the compute for more steps is more beneficial. But if you want to implement it, it can be easily done by storing all the random seeds for each step. For learning rate schedule, we did some preliminary exploration in Appendix A.2 but did not find it beneficial. |
Sorry I forgot to say... You can ignore the AdamW warning as we are not using it in MeZO. |
Hi @gaotianyu1350, Thank you very much for your prompt reply! I can use a larger batch size with FSDP. Training is going fine, but I'm encountering this error when I evaluate on the validation set, or save a checkpoint.
Just wanted to check if you tried using MeZO with FSDP so I narrow down the problem. I'm finetuning LLaMA13B. I added the decorator Your input is very much appreciated. |
Hi, I haven't tried using FSDP on MeZO because it slows down the training a bit I believe. If you want to use more GPUs, I think you can just use the original mezo script (which supports multi-GPU; it uses You can check out this line https://github.com/princeton-nlp/MeZO/blob/9a51fbf46849d72c85416150b017b954afb91357/large_models/run.py#LL166C1-L167C1 To enable larger batches with more GPUs, change "5" here to be a larger number (it tells the autocast how much memory to save on each GPU for non-parameter memory use). |
Hi @gaotianyu1350, Thank you for your response. I tried finetuning a 7B model with MeZO on 1 GPU and it was fine saving and evaluating, so I'm guessing there must be an issue integrating MeZO with FSDP. I'm hoping I can use MeZO with models that do not fit on 1 GPU, so I have to use FSDP. I'll dig further. I'm having issues using MeZO to finetune a LLaMA-13B on conversation data. The loss does not decrease. I tried decreasing and increasing the learning rate. I also tried schedulers to no avail. Do you think it might be related to the context size I'm using? Or maybe MeZO hyperparameters? |
Hi, Because MeZO does not require backpropagation, you do not need FSDP for multi-GPU. MeZO uses Huggingface's accelerate package to enable multi-GPU inference (because MeZO only needs "inference" mode). As for the loss problem, I suggest starting from PEFT methods (prompt tuning or LoRA) and tuning the hyperparameters (you can start from the ones suggested in our appendix). PEFT methods are usually easier to tune. |
Hello,
Thank you for sharing your work!
I noticed that the
finetune.sh
andfinetune_fsdp.sh
use regular training by default. Should I change it tozo
to enable MeZO trainer? Also, I'm getting the errorAttributeError: 'LlamaForCausalLM' object has no attribute 'module'
when I try to finetune LLaMA on one GPU when usingregular
as trainer. The error stems from this line:Any help would be very much appreciated.
The text was updated successfully, but these errors were encountered: