-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataloader killed: system memory utilisation is monotonically increasing during training #37
Comments
The same observation for the MADOS dataset with the default training command of the README. Fortunately, the training ends before reaching the max value of system memory utilization (green curves and violet ones are the reference run on PASTIS mentioned above).
|
I think this is similar to what I ran into, some of my stuff also got killed after 60-70 epochs. It's interesting that @SebastianGer 's training logs didn't show the same thing. I'll try looking into this tomorrow as well.
Probably just the vram getting hit after you run out of regular ram. |
Ok, so I looked at several things (including logging, losses, pin_memory...) and ended up flattening the system memory utilization by setting |
…ry-utilisation-is-monotonically-increasing-during-training Remove persistent workers. Fix #37
I'll keep this open for now. Setting Also depending on the implementation, worker_threads=0 will probably bring the issue back for future users. I'll run some experiments today to confirm. |
I don't have time to look deeper into it, but it seems persistent workers sometimes misbehave with data shuffling. |
I've looked around, it might be related to pytorch/pytorch#13246 :/ |
I can reproduce the increasing memory utilization but in my case it is very moderate. There should be something related to the experiment environment. With num_workers=0 the main process will do the data loading and it behaves the same as a persistent worker. I know a solution but it is not straightforward (https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/). I don't really think totally fixing it is our priority (not responsibility even) and setting persisent worker as False is ok. |
Yeah that's a good writeup of the solutions mentioned in the thread about the issue. Your graph looks a lot tamer than mine, you can see how memory usage in my case went up by ~600% over 80 epochs. What was your run config? |
I launched a training of 80 epochs on the PASTIS dataset with
num_workers=1
andbatch_size-8
(which means I have a lot of GPU memory headroom). With the PASTIS dataset I'm usingResizeToEncoder
which is independent of theTiling
data augmentation and then it's not linked to #21Error raised during evaluation at epoch 70 (sorry for the screenshot but it gets messy when copy-pasting logs) but I think it's not directly linked to validation.
When looking at the system logs it seems the amount of MB written to the disk rockets up just before the crash
But more importantly, the system memory utilization is monotonically increasing during training:
I need to check with other datasets. This behavior seems to appear for all models.
The text was updated successfully, but these errors were encountered: