Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which trainer to use #10

Open
HaniItani opened this issue Jun 21, 2023 · 7 comments
Open

Which trainer to use #10

HaniItani opened this issue Jun 21, 2023 · 7 comments

Comments

@HaniItani
Copy link

HaniItani commented Jun 21, 2023

Hello,

Thank you for sharing your work!

I noticed that the finetune.sh and finetune_fsdp.sh use regular training by default. Should I change it to zo to enable MeZO trainer? Also, I'm getting the error AttributeError: 'LlamaForCausalLM' object has no attribute 'module' when I try to finetune LLaMA on one GPU when using regular as trainer. The error stems from this line:

with model.no_sync():
    tr_loss_step = self.training_step(model, inputs)

Any help would be very much appreciated.

@HaniItani
Copy link
Author

HaniItani commented Jun 21, 2023

Issue above solved. I figured that you have finetune.sh to reproduce the experiments in the paper. I should actually use mezo.sh to train with MeZO. May I ask why gradient accumulation is not supported for MeZO? Did you experiment with learning rate schedulers? I'm also getting a warning that AdamW is being used when using MeZO trainer loop

FutureWarning: This implementation of  AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

I'm using MeZO to try and fully finetune a moel. Usually the batch size used for my task is per_device_batch_size x gradient_accumulation_steps x num_gpus = 128. I think by limiting the max_tokens enabled you to train with a per_device_batch_size=16, but I think this is limiting to tasks such as chat finetuned models for example. Do you have any suggestion on how I can train with larger batch sizes with higher max_new_tokens?

@gaotianyu1350
Copy link
Member

Hi,

Glad that you made it work! For gradient accumulation, we did not implement it because in our preliminary experiments, we found that instead of doing gradient accumulation, using the compute for more steps is more beneficial. But if you want to implement it, it can be easily done by storing all the random seeds for each step. For learning rate schedule, we did some preliminary exploration in Appendix A.2 but did not find it beneficial.

@gaotianyu1350
Copy link
Member

Sorry I forgot to say... You can ignore the AdamW warning as we are not using it in MeZO.

@HaniItani
Copy link
Author

HaniItani commented Jun 21, 2023

Hi @gaotianyu1350,

Thank you very much for your prompt reply! I can use a larger batch size with FSDP. Training is going fine, but I'm encountering this error when I evaluate on the validation set, or save a checkpoint.

RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before
doing inplace update.See https://github.com/pytorch/rfcs/pull/17 for more details.

Just wanted to check if you tried using MeZO with FSDP so I narrow down the problem. I'm finetuning LLaMA13B. I added the decorator @torch.inference_mode() before prediction_step and _save_checkpoint in trainer.py and it seems to work, but I doubt this is the right way of dealing with it.

Your input is very much appreciated.

@gaotianyu1350
Copy link
Member

Hi,

I haven't tried using FSDP on MeZO because it slows down the training a bit I believe. If you want to use more GPUs, I think you can just use the original mezo script (which supports multi-GPU; it uses accelerate package and autocasts the model to multiple GPUs).

You can check out this line https://github.com/princeton-nlp/MeZO/blob/9a51fbf46849d72c85416150b017b954afb91357/large_models/run.py#LL166C1-L167C1

To enable larger batches with more GPUs, change "5" here to be a larger number (it tells the autocast how much memory to save on each GPU for non-parameter memory use).

@HaniItani
Copy link
Author

HaniItani commented Jun 22, 2023

Hi @gaotianyu1350,

Thank you for your response. I tried finetuning a 7B model with MeZO on 1 GPU and it was fine saving and evaluating, so I'm guessing there must be an issue integrating MeZO with FSDP. I'm hoping I can use MeZO with models that do not fit on 1 GPU, so I have to use FSDP. I'll dig further.

I'm having issues using MeZO to finetune a LLaMA-13B on conversation data. The loss does not decrease. I tried decreasing and increasing the learning rate. I also tried schedulers to no avail. Do you think it might be related to the context size I'm using? Or maybe MeZO hyperparameters?

@gaotianyu1350
Copy link
Member

Hi,

Because MeZO does not require backpropagation, you do not need FSDP for multi-GPU. MeZO uses Huggingface's accelerate package to enable multi-GPU inference (because MeZO only needs "inference" mode).

As for the loss problem, I suggest starting from PEFT methods (prompt tuning or LoRA) and tuning the hyperparameters (you can start from the ones suggested in our appendix). PEFT methods are usually easier to tune.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants