-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory error on training #21
Comments
I also updated
|
Hi Aftab, Thanks for submitting this issue. It took a while to debug; the main thing I have found so far is that I can run this just fine on 4 GPUs (with 48GB of memory, instead of 24GB), but not on 2 and 1. And to be clear, the batch size scales to the number of GPUs, so when I run on 4, I am also running with 4x the data compared to a single GPU. I have never seen this kind of behavior before; normally, when you scale the batch size by the number of GPUs like that, you can run just fine on any number of GPUs. My main guess is that something about the model or optimizer state is getting distributed across the GPUs. That way, when there are more GPUs, the memory cost per GPU is lower. I have tried disabling every setting I could think of in the config, but could not make this behavior go away, so I am not sure if this is something I can remedy. It may be worth posting to the original GPT-NeoX repository directly. My only concern is that this could be a Docker issue, somehow, in which case it isn't on their end either; you may want to try this in a virtualenv instead, to test it without using Docker. I will see what else I can try, but for now this is the state of affairs: training the largest model requires an excessive amount of memory on a single GPU, presumably because the optimizer states alone take up more than 48GB. -Vincent |
Hi Vincent, Thank you so much for your time. Indeed it is very interesting why Pytorch always reserves such a huge chunk of memory, such that there is always less free space than the amount the script tries to allocate. I saw changing the My Best Regards, |
I was trying to train Polycoder using the preconfigured dataset, from the checkpoint
checkpoints-2-7B
, I used the following command as per the instructions in the repo (only changing the configs as appropriate):sudo python ./deepy.py train.py -d configs 2-7B.yml local_setup.yml
which gave the following error:
RuntimeError: CUDA out of memory. Tried to allocate 1.86 GiB (GPU 0; 23.70 GiB total capacity; 20.49 GiB already allocated; 1.74 GiB free; 20.50 GiB reserved in total by PyTorch)
Interestingly, the full 25 Gigs of our GPU is free, as per nvidia-smi.
I tried updating the batch size, and the the only location I found to update batch size in the config files was
train_micro_batch_size_per_gpu: 8
, in2-7B.yml
.It was 8, I changed it to 4, and then also to 1, but in both cases got the same error.
I am running all this in docker, as per the containerized setup instructions.
Appreciate any help!
The text was updated successfully, but these errors were encountered: