Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probable GPU Memory Leak In Finetuning #30

Open
israwal opened this issue Jun 6, 2023 · 0 comments
Open

Probable GPU Memory Leak In Finetuning #30

israwal opened this issue Jun 6, 2023 · 0 comments

Comments

@israwal
Copy link

israwal commented Jun 6, 2023

Hi,

Thanks for making the code available. I recently encountered error during finetuning Singularity-Temporal for my own dataset. While the finetuning experiment was successful for my trial experiment with a subset of the dataset, it failed at ~epoch 6 on the full-fledged dataset without an informative error message report (the batch size was same in both the experiments).
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104856 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104857 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104858 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104859 closing signal SIGHUP ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{ "message": { "message": "SignalException: Process 3104850 got signal: 1",

This seems to be an issue of GPU memory leak.
del question_input, image, answer_input at the end of training and evaluation loops in vqa.py helped me resolve the issue.

PS: I haven't tried reproducing it for the reported datasets but only for my custom dataset. Posting the issue just in case anyone else is in the same boat.

Thanks!
I.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant