Dataloader killed: system memory utilisation is monotonically increasing during training #37

gle-bellier · 2024-09-14T10:07:18Z

I launched a training of 80 epochs on the PASTIS dataset with num_workers=1 and batch_size-8 (which means I have a lot of GPU memory headroom). With the PASTIS dataset I'm using ResizeToEncoder which is independent of the Tiling data augmentation and then it's not linked to #21

Error raised during evaluation at epoch 70 (sorry for the screenshot but it gets messy when copy-pasting logs) but I think it's not directly linked to validation.

When looking at the system logs it seems the amount of MB written to the disk rockets up just before the crash

But more importantly, the system memory utilization is monotonically increasing during training:

I need to check with other datasets. This behavior seems to appear for all models.

The text was updated successfully, but these errors were encountered:

gle-bellier · 2024-09-14T12:12:22Z

The same observation for the MADOS dataset with the default training command of the README. Fortunately, the training ends before reaching the max value of system memory utilization (green curves and violet ones are the reference run on PASTIS mentioned above).

torchrun --nnodes=1 --nproc_per_node=1 run.py \ --config configs/run/default.yaml \ --encoder_config configs/foundation_models/prithvi.yaml \ --dataset_config configs/datasets/mados.yaml \ --segmentor_config configs/segmentors/upernet.yaml \ --augmentation_config configs/augmentations/segmentation_default.yaml \ --num_workers 4 --eval_interval 1 --use_wandb

gle-bellier · 2024-09-15T09:57:55Z

I'm observing the same behavior independently of the dataset (previous comment) and the model (see screenshot below). I'm looking at the training code for bugs.
All the visible runs crashed after hitting max system memory utilisation.

KerekesDavid · 2024-09-15T15:25:47Z

I think this is similar to what I ran into, some of my stuff also got killed after 60-70 epochs. It's interesting that @SebastianGer 's training logs didn't show the same thing. I'll try looking into this tomorrow as well.

When looking at the system logs it seems the amount of MB written to the disk rockets up just before the crash

Probably just the vram getting hit after you run out of regular ram.

gle-bellier · 2024-09-15T17:44:28Z

Ok, so I looked at several things (including logging, losses, pin_memory...) and ended up flattening the system memory utilization by setting persistent_workers to False.
More investigation is needed if we want to use this option but IMHO it's not the priority.

…ry-utilisation-is-monotonically-increasing-during-training Remove persistent workers. Fix #37

KerekesDavid · 2024-09-16T09:46:23Z

I'll keep this open for now. Setting persistent_workers = false is probably just a workaround to the workers keeping something in memory they shouldn't, and we shouldn't need to continually kill and respawn worker threads to have stable training.

Also depending on the implementation, worker_threads=0 will probably bring the issue back for future users.

I'll run some experiments today to confirm.

gle-bellier · 2024-09-16T13:08:53Z

I don't have time to look deeper into it, but it seems persistent workers sometimes misbehave with data shuffling.

KerekesDavid · 2024-09-16T16:34:49Z

I've looked around, it might be related to pytorch/pytorch#13246 :/

KerekesDavid · 2024-09-17T08:02:06Z

This is with

torchrun --nproc_per_node=1 run.py --config configs/run/mados_prithvi.yaml --num_workers 0 --eval_interval 1 --epochs 80 --use_wand

on master, so the problem still exists with num_workers=0.

LeungTsang · 2024-09-18T11:31:36Z

I can reproduce the increasing memory utilization but in my case it is very moderate. There should be something related to the experiment environment. With num_workers=0 the main process will do the data loading and it behaves the same as a persistent worker. I know a solution but it is not straightforward (https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/). I don't really think totally fixing it is our priority (not responsibility even) and setting persisent worker as False is ok.

KerekesDavid · 2024-09-25T08:47:04Z

I know a solution but it is not straightforward (https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/).

Yeah that's a good writeup of the solutions mentioned in the thread about the issue.

Your graph looks a lot tamer than mine, you can see how memory usage in my case went up by ~600% over 80 epochs. What was your run config?

KerekesDavid assigned KerekesDavid and gle-bellier Sep 15, 2024

KerekesDavid added the bug Something isn't working label Sep 15, 2024

gle-bellier linked a pull request Sep 15, 2024 that will close this issue

Remove persistent workers. Fix #37 #38

Merged

gle-bellier closed this as completed in #38 Sep 16, 2024

gle-bellier closed this as completed in 50e2612 Sep 16, 2024

gle-bellier added a commit that referenced this issue Sep 16, 2024

Merge pull request #38 from yurujaja/37-dataloader-killed-system-memo…

879e854

…ry-utilisation-is-monotonically-increasing-during-training Remove persistent workers. Fix #37

KerekesDavid reopened this Sep 16, 2024

KerekesDavid mentioned this issue Sep 17, 2024

Figure out what's happening with preprocessing memory usage. #22

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloader killed: system memory utilisation is monotonically increasing during training #37

Dataloader killed: system memory utilisation is monotonically increasing during training #37

gle-bellier commented Sep 14, 2024

gle-bellier commented Sep 14, 2024

gle-bellier commented Sep 15, 2024

KerekesDavid commented Sep 15, 2024

gle-bellier commented Sep 15, 2024

KerekesDavid commented Sep 16, 2024

gle-bellier commented Sep 16, 2024

KerekesDavid commented Sep 16, 2024

KerekesDavid commented Sep 17, 2024 •

edited

Loading

LeungTsang commented Sep 18, 2024

KerekesDavid commented Sep 25, 2024

Dataloader killed: system memory utilisation is monotonically increasing during training #37

Dataloader killed: system memory utilisation is monotonically increasing during training #37

Comments

gle-bellier commented Sep 14, 2024

gle-bellier commented Sep 14, 2024

gle-bellier commented Sep 15, 2024

KerekesDavid commented Sep 15, 2024

gle-bellier commented Sep 15, 2024

KerekesDavid commented Sep 16, 2024

gle-bellier commented Sep 16, 2024

KerekesDavid commented Sep 16, 2024

KerekesDavid commented Sep 17, 2024 • edited Loading

LeungTsang commented Sep 18, 2024

KerekesDavid commented Sep 25, 2024

KerekesDavid commented Sep 17, 2024 •

edited

Loading