Validation step uses 2x memory and 12x compute time #558
-
The validation stage of running an experiment consistently uses about 2x the memory of the training stage, and experiments that fail during the validation stage are considered failures and (in the UI at least) cannot be used in inference. This restriction is very limiting since it means I can only train with half my available memory for the rest to be used during validation. (Might this be related to the default settings, since 512 tokens as the max length = 2x 256, the max number of response tokens?) Additionally, the validation processing takes an extraordinary amount of time, almost 12x longer than training! In my configuration, I have validation size = 0.01 and data sample = 0.3, but for training the logs say Is there a way for me to skip the validation step of the experiment? I just want to finetune a model via LoRA. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I was using the BLEU metric, and switching to Perplexity resulted in much reduced memory usage (comparable to training memory usage) and reduced compute time (still takes a long time, but not as bad as BLEU) |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
You can sample the validation set to reduce the time further:
Or create a custom validation dataset with only very few samples.
It is expected that the validation metrics that rely on generation of new output (BLEU and GPT) are slower than training. Perplexity should have about the same speed as your training.