-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a jupyter file, for try largest batch #23
base: main
Are you sure you want to change the base?
Conversation
I wonder if the problem occurs at nearly all models' training process during training Argoverse2?Does it indicates that training model on Argoverse2 requires computation power?Has anyone ever try to reduce the batchsize and the learning rate?Can we get the same experiment result as the paper has mentioned? |
Hi @HLSS-Hen, Thanks for contributing to this repo! I'm sorry for not getting back to you sooner. Could you please add a section in the README.md to briefly illustrate the usage of this script so that people who come across the OOM issue can know how to use the script? Thanks! |
@ZikangZhou , I'm not good at writing, you may revise it. Treat License cell as the 0-th Jupyter cell, users need to correctly fill in the CUDA Device used ( Execute all cells in sequence, and the code will automatically form a train step input with the first batchsize large samples (order by sample file size), completing the complete single forward and backward propagation of the train step.
If all executions have ended without any errors, the current batchsize can be considered. You can use If you are not good at using Jupyter, you can directly copy the code of each cell into a new |
@HLSS-Hen |
@SunHaoOne , For those unfamiliar with |
#2 , the VRAM leak issue.
This is not a memory leak, but rather a result of a significant disparity in the size of input samples. Let's take the processed training data file size as an example. There are only 110 samples larger than 500KB and fewer than 20,000 samples larger than 300KB (considering there are nearly 200,000 samples in total).
During the training process, batches are randomly composed (batch_size << samples). In extreme cases, the largest samples can easily trigger an OOM error. The seemingly leaking memory is actually caused by the generation of larger batches. If you call
torch.cuda.empty_cache()
, you will see that the actual memory usage fluctuates.I've written this Jupyter script to help you confirm if your GPU can handle extremely large batch. By manually adjusting the
BATCH_SIZE
, you can determine the appropriate size for training. However, I cannot guarantee how much space should be reserved as safe, as someone mentioned experiencing an OOM error with a batch size of 2 on a 24GB GPU, while this script was able to run successfully.