Probable GPU Memory Leak In Finetuning #30

israwal · 2023-06-06T08:09:06Z

Hi,

Thanks for making the code available. I recently encountered error during finetuning Singularity-Temporal for my own dataset. While the finetuning experiment was successful for my trial experiment with a subset of the dataset, it failed at ~epoch 6 on the full-fledged dataset without an informative error message report (the batch size was same in both the experiments).
WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104856 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104857 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104858 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3104859 closing signal SIGHUP ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{ "message": { "message": "SignalException: Process 3104850 got signal: 1",

This seems to be an issue of GPU memory leak.
del question_input, image, answer_input at the end of training and evaluation loops in vqa.py helped me resolve the issue.

PS: I haven't tried reproducing it for the reported datasets but only for my custom dataset. Posting the issue just in case anyone else is in the same boat.

Thanks!
I.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probable GPU Memory Leak In Finetuning #30

Probable GPU Memory Leak In Finetuning #30

israwal commented Jun 6, 2023 •

edited

Loading

Probable GPU Memory Leak In Finetuning #30

Probable GPU Memory Leak In Finetuning #30

Comments

israwal commented Jun 6, 2023 • edited Loading

israwal commented Jun 6, 2023 •

edited

Loading