Which trainer to use #10

HaniItani · 2023-06-21T13:53:52Z

Hello,

Thank you for sharing your work!

I noticed that the finetune.sh and finetune_fsdp.sh use regular training by default. Should I change it to zo to enable MeZO trainer? Also, I'm getting the error AttributeError: 'LlamaForCausalLM' object has no attribute 'module' when I try to finetune LLaMA on one GPU when using regular as trainer. The error stems from this line:

with model.no_sync():
    tr_loss_step = self.training_step(model, inputs)

Any help would be very much appreciated.

The text was updated successfully, but these errors were encountered:

HaniItani · 2023-06-21T16:51:57Z

Issue above solved. I figured that you have finetune.sh to reproduce the experiments in the paper. I should actually use mezo.sh to train with MeZO. May I ask why gradient accumulation is not supported for MeZO? Did you experiment with learning rate schedulers? I'm also getting a warning that AdamW is being used when using MeZO trainer loop

FutureWarning: This implementation of  AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

I'm using MeZO to try and fully finetune a moel. Usually the batch size used for my task is per_device_batch_size x gradient_accumulation_steps x num_gpus = 128. I think by limiting the max_tokens enabled you to train with a per_device_batch_size=16, but I think this is limiting to tasks such as chat finetuned models for example. Do you have any suggestion on how I can train with larger batch sizes with higher max_new_tokens?

gaotianyu1350 · 2023-06-21T17:50:00Z

Hi,

Glad that you made it work! For gradient accumulation, we did not implement it because in our preliminary experiments, we found that instead of doing gradient accumulation, using the compute for more steps is more beneficial. But if you want to implement it, it can be easily done by storing all the random seeds for each step. For learning rate schedule, we did some preliminary exploration in Appendix A.2 but did not find it beneficial.

gaotianyu1350 · 2023-06-21T17:50:53Z

Sorry I forgot to say... You can ignore the AdamW warning as we are not using it in MeZO.

HaniItani · 2023-06-21T20:55:33Z

Hi @gaotianyu1350,

Thank you very much for your prompt reply! I can use a larger batch size with FSDP. Training is going fine, but I'm encountering this error when I evaluate on the validation set, or save a checkpoint.

RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before
doing inplace update.See https://github.com/pytorch/rfcs/pull/17 for more details.

Just wanted to check if you tried using MeZO with FSDP so I narrow down the problem. I'm finetuning LLaMA13B. I added the decorator @torch.inference_mode() before prediction_step and _save_checkpoint in trainer.py and it seems to work, but I doubt this is the right way of dealing with it.

Your input is very much appreciated.

gaotianyu1350 · 2023-06-22T13:47:28Z

Hi,

I haven't tried using FSDP on MeZO because it slows down the training a bit I believe. If you want to use more GPUs, I think you can just use the original mezo script (which supports multi-GPU; it uses accelerate package and autocasts the model to multiple GPUs).

You can check out this line https://github.com/princeton-nlp/MeZO/blob/9a51fbf46849d72c85416150b017b954afb91357/large_models/run.py#LL166C1-L167C1

To enable larger batches with more GPUs, change "5" here to be a larger number (it tells the autocast how much memory to save on each GPU for non-parameter memory use).

HaniItani · 2023-06-22T16:05:30Z

Hi @gaotianyu1350,

Thank you for your response. I tried finetuning a 7B model with MeZO on 1 GPU and it was fine saving and evaluating, so I'm guessing there must be an issue integrating MeZO with FSDP. I'm hoping I can use MeZO with models that do not fit on 1 GPU, so I have to use FSDP. I'll dig further.

I'm having issues using MeZO to finetune a LLaMA-13B on conversation data. The loss does not decrease. I tried decreasing and increasing the learning rate. I also tried schedulers to no avail. Do you think it might be related to the context size I'm using? Or maybe MeZO hyperparameters?

gaotianyu1350 · 2023-06-26T12:23:33Z

Hi,

Because MeZO does not require backpropagation, you do not need FSDP for multi-GPU. MeZO uses Huggingface's accelerate package to enable multi-GPU inference (because MeZO only needs "inference" mode).

As for the loss problem, I suggest starting from PEFT methods (prompt tuning or LoRA) and tuning the hyperparameters (you can start from the ones suggested in our appendix). PEFT methods are usually easier to tune.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which trainer to use #10

Which trainer to use #10

HaniItani commented Jun 21, 2023 •

edited

Loading

HaniItani commented Jun 21, 2023 •

edited

Loading

gaotianyu1350 commented Jun 21, 2023

gaotianyu1350 commented Jun 21, 2023

HaniItani commented Jun 21, 2023 •

edited

Loading

gaotianyu1350 commented Jun 22, 2023

HaniItani commented Jun 22, 2023 •

edited

Loading

gaotianyu1350 commented Jun 26, 2023

Which trainer to use #10

Which trainer to use #10

Comments

HaniItani commented Jun 21, 2023 • edited Loading

HaniItani commented Jun 21, 2023 • edited Loading

gaotianyu1350 commented Jun 21, 2023

gaotianyu1350 commented Jun 21, 2023

HaniItani commented Jun 21, 2023 • edited Loading

gaotianyu1350 commented Jun 22, 2023

HaniItani commented Jun 22, 2023 • edited Loading

gaotianyu1350 commented Jun 26, 2023

HaniItani commented Jun 21, 2023 •

edited

Loading

HaniItani commented Jun 21, 2023 •

edited

Loading

HaniItani commented Jun 21, 2023 •

edited

Loading

HaniItani commented Jun 22, 2023 •

edited

Loading