Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader killed: system memory utilisation is monotonically increasing during training #37

Open
gle-bellier opened this issue Sep 14, 2024 · 10 comments · Fixed by #38
Assignees
Labels
bug Something isn't working

Comments

@gle-bellier
Copy link
Collaborator

I launched a training of 80 epochs on the PASTIS dataset with num_workers=1 and batch_size-8 (which means I have a lot of GPU memory headroom). With the PASTIS dataset I'm using ResizeToEncoder which is independent of the Tiling data augmentation and then it's not linked to #21

Error raised during evaluation at epoch 70 (sorry for the screenshot but it gets messy when copy-pasting logs) but I think it's not directly linked to validation.
image

When looking at the system logs it seems the amount of MB written to the disk rockets up just before the crash
image

But more importantly, the system memory utilization is monotonically increasing during training:
image

I need to check with other datasets. This behavior seems to appear for all models.

@gle-bellier
Copy link
Collaborator Author

The same observation for the MADOS dataset with the default training command of the README. Fortunately, the training ends before reaching the max value of system memory utilization (green curves and violet ones are the reference run on PASTIS mentioned above).

torchrun --nnodes=1 --nproc_per_node=1 run.py \ --config configs/run/default.yaml \ --encoder_config configs/foundation_models/prithvi.yaml \ --dataset_config configs/datasets/mados.yaml \ --segmentor_config configs/segmentors/upernet.yaml \ --augmentation_config configs/augmentations/segmentation_default.yaml \ --num_workers 4 --eval_interval 1 --use_wandb

image
image

@gle-bellier
Copy link
Collaborator Author

I'm observing the same behavior independently of the dataset (previous comment) and the model (see screenshot below). I'm looking at the training code for bugs.
All the visible runs crashed after hitting max system memory utilisation.
image

@KerekesDavid
Copy link
Collaborator

I think this is similar to what I ran into, some of my stuff also got killed after 60-70 epochs. It's interesting that @SebastianGer 's training logs didn't show the same thing. I'll try looking into this tomorrow as well.

When looking at the system logs it seems the amount of MB written to the disk rockets up just before the crash

Probably just the vram getting hit after you run out of regular ram.

@KerekesDavid KerekesDavid added the bug Something isn't working label Sep 15, 2024
@gle-bellier
Copy link
Collaborator Author

Ok, so I looked at several things (including logging, losses, pin_memory...) and ended up flattening the system memory utilization by setting persistent_workers to False.
More investigation is needed if we want to use this option but IMHO it's not the priority.

@gle-bellier gle-bellier linked a pull request Sep 15, 2024 that will close this issue
gle-bellier added a commit that referenced this issue Sep 16, 2024
…ry-utilisation-is-monotonically-increasing-during-training

Remove persistent workers. Fix #37
@KerekesDavid
Copy link
Collaborator

I'll keep this open for now. Setting persistent_workers = false is probably just a workaround to the workers keeping something in memory they shouldn't, and we shouldn't need to continually kill and respawn worker threads to have stable training.

Also depending on the implementation, worker_threads=0 will probably bring the issue back for future users.

I'll run some experiments today to confirm.

@KerekesDavid KerekesDavid reopened this Sep 16, 2024
@gle-bellier
Copy link
Collaborator Author

I don't have time to look deeper into it, but it seems persistent workers sometimes misbehave with data shuffling.

@KerekesDavid
Copy link
Collaborator

I've looked around, it might be related to pytorch/pytorch#13246 :/

@KerekesDavid
Copy link
Collaborator

KerekesDavid commented Sep 17, 2024

Screenshot from 2024-09-17 09-59-24

This is with

torchrun --nproc_per_node=1 run.py --config configs/run/mados_prithvi.yaml --num_workers 0 --eval_interval 1 --epochs 80 --use_wand

on master, so the problem still exists with num_workers=0.

@LeungTsang
Copy link
Collaborator

I can reproduce the increasing memory utilization but in my case it is very moderate. There should be something related to the experiment environment. With num_workers=0 the main process will do the data loading and it behaves the same as a persistent worker. I know a solution but it is not straightforward (https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/). I don't really think totally fixing it is our priority (not responsibility even) and setting persisent worker as False is ok.
Screenshot from 2024-09-18 12-55-13

@KerekesDavid
Copy link
Collaborator

I know a solution but it is not straightforward (https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/).

Yeah that's a good writeup of the solutions mentioned in the thread about the issue.

Your graph looks a lot tamer than mine, you can see how memory usage in my case went up by ~600% over 80 epochs. What was your run config?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants