Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a jupyter file, for try largest batch #23

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

HLSS-Hen
Copy link

#2 , the VRAM leak issue.
This is not a memory leak, but rather a result of a significant disparity in the size of input samples. Let's take the processed training data file size as an example. There are only 110 samples larger than 500KB and fewer than 20,000 samples larger than 300KB (considering there are nearly 200,000 samples in total).

During the training process, batches are randomly composed (batch_size << samples). In extreme cases, the largest samples can easily trigger an OOM error. The seemingly leaking memory is actually caused by the generation of larger batches. If you call torch.cuda.empty_cache(), you will see that the actual memory usage fluctuates.

I've written this Jupyter script to help you confirm if your GPU can handle extremely large batch. By manually adjusting the BATCH_SIZE, you can determine the appropriate size for training. However, I cannot guarantee how much space should be reserved as safe, as someone mentioned experiencing an OOM error with a batch size of 2 on a 24GB GPU, while this script was able to run successfully.

@ChengkaiYang
Copy link

I wonder if the problem occurs at nearly all models' training process during training Argoverse2?Does it indicates that training model on Argoverse2 requires computation power?Has anyone ever try to reduce the batchsize and the learning rate?Can we get the same experiment result as the paper has mentioned?

@ZikangZhou
Copy link
Owner

Hi @HLSS-Hen,

Thanks for contributing to this repo! I'm sorry for not getting back to you sooner. Could you please add a section in the README.md to briefly illustrate the usage of this script so that people who come across the OOM issue can know how to use the script? Thanks!

@HLSS-Hen
Copy link
Author

@ZikangZhou , I'm not good at writing, you may revise it.

Treat License cell as the 0-th Jupyter cell, users need to correctly fill in the CUDA Device used (torch.cuda.set_device(DEVICE_ID)), dataset root, and batch size in the first cell. Simultaneously configure the model in the third cell correctly to match the actual model used.

Execute all cells in sequence, and the code will automatically form a train step input with the first batchsize large samples (order by sample file size), completing the complete single forward and backward propagation of the train step.

If a dataset download occurs during data loading, it indicates that the given dataset root is incorrect, and the newly downloaded file needs to be deleted to keep the disk clean.

When executing the last cell, if the batch size set by the OOM representative is too large, please consider purchasing a larger VRAM GPU, gradient accumulation, using a smaller batch size, etc.

If all executions have ended without any errors, the current batchsize can be considered. You can use nvidia-smi command to check the VRAM use. However, it should be noted that your desktop operating environment, other programs on the graphics card, and some parallel training strategies, etc, all require a certain amount of VRAM.

If you are not good at using Jupyter, you can directly copy the code of each cell into a new .py file and execute it. If you need check VRAM use, you can add input()at the end of the file.

@SunHaoOne
Copy link

@HLSS-Hen
Hi, I've come up with a great idea. I experimented with using PyTorch 2's features and found that invoking model = torch.compile(model) in the train_qcnet.py significantly reduces memory usage. This approach leverages the new capabilities introduced in PyTorch 2, optimizing the model's memory footprint efficiently.

@HLSS-Hen
Copy link
Author

HLSS-Hen commented Dec 5, 2023

@SunHaoOne ,
Oh, yes, this is likely because torch.compile combines some operators, reducing VRAM cost. In fact, I am not familiar with PyTorch Lightning, and I always thought that PL would automatically execute this function.

For those unfamiliar with torch.compile operations, please read PyTroch document about torch.compile,as this function will lead to an increase in training preparation time and memory (not VRAM) usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants