-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocessor memory usage is excessive and error messages unclear. #51
Comments
Thank you for opening this issue. Can you please attach the precise recipe you are using to this issue and if possible, specify how much memory it requires or how much it uses when it crashes? |
Please don't mix things up, Rob was complaining about a certain memory issue that is completely inexplicable given that the recipe he ran (the exact settings of it) was run by both Ranjini and me and no memory issues were found. It's a trivial recipe with trivial computations (no multimodel or any other memory-heavy operations). I am starting to look at this specific recipe |
This is easily diagnosable if you run in debug mode, when |
Change the dataset line to:
ie change |
cool, cheers, man! will have a play now |
can you pls advise where to get the WOA obs4mips files from on jasmin? Also, are you using the obs4mips in the multimodel? because if not, having just a single CMIP5 dataset means no multimodel computations will be done |
I can copy over the WOA files now. Where shall I put them? |
Is this |
Obs, I would guess. Whats the difference? Happy to provide the script, but it's not pretty. |
it's obs4mips, but not playing any role in this issue actually, found the memory leak...hold on will post the result |
OK guys, found the memory thief: I used Lee's recipe that I trimmed heavily but kept the 1950-2004 data range...and the bugger is
the run doesn't even get into the meat of the preprocessor, it spends about 2.5 minutes inside |
also FYI @jvegasbsc |
also, this memory spill does not scale with number of variables (same behaviour for 5, 2 and single variable), which statement is pretty obvious (since the data fix in a single var cube is consuming loads of memory) but worth reporting |
and fortunately getting onto jasmin-sci3 (that has 2TB of RAM (!!!) I amanged to run this model to completion:
so the two big culprits: fix_data at 17GB and average_volume at 19GB |
it is important to define a data vs process ratio (R): the data comes in 6 files = 5 x 1.5G + 1 x 1G = 8.5G of netCDF data; fix_data has R = 2 and average_volume has an R = 2.3 so this means to the very least that both these modules work with two copies of the data which is very inefficient! |
and this R is constant ie the memory intake vs used data behaviour is scalable: I ran the same test with half the data (1977-2004):
|
note that the memory transferred between the two modules is 9.7G and 5.0G respectively which is slightly more than the equivalent total amount of data as loaded from disk (extra: some fx files, python modules etc.) so this is normal OK |
I also ran with a simple two-model setup so I can test the multimodel module (no average_volume this time, just multimodel: mean in my preprocessor): the datas are about 5 + 7 = 12GB and sure enough fix_data used 9GB then 14GB of RAM and the multimodel had a peak of 25GB which gives an R index of R ~ 2.1 so yeah, really, multimodel is not great but it could be much worse 😁 The problem with multimodel is when we need to churn 20 models each worth 100 years |
we can actually establish a very rough O(order of magnitude) behavior of the total memory needed by ESMValTool wrt input netCDF data: take for example fix_data:
So for a preprocessor module that has an R ~ 2 and is not multimodel (fix_data and average_volume are about the same give or take): for multimodel it's a bit better than expected because so for a typical R = 2, F = 1GB, N = 4 you'd epect a preprocessor module to take not too bad! |
note that the only bit that we can optimize given these equations is R - the module's computational efficiency and even if, in the ideal case, R = 0 we still have a max memory limited by the number of files N and the average file size F: maxRAM_ideal = (N - 1) x F for a single file module and |
we can generalize these equations for the cases that the preprocessing module receives data that has been chopped (level extraction, area/volume subsetting) and if the data is not fully realized:
so for cases when we deal with a lot of datasets (R + N ~ N), data is fully realized, assuming an average size of 1.5GB for 10 years of 3D netCDF data, N datasets will require
so interestingly enough, the multimodel will actually require less maximum memory than a non-multimodel module (not by much given that N is large, but yeah) 😃 |
@ledm you may wanna try with this ESMValGroup/ESMValTool#816 to improve your multimodel life when it's got merged |
having
we can use R = 3 to set an upper limit on the maximum memory any given preprocessor should take so we can detect memory leaks ie a memory leak doesn't necessarily mean memory intake going through the roof, a lot of bad coding may lead to memory intakes of 30-60% more than what's actually needed so we can detect this fairly robustly |
can we close this? @ledm |
added the memory equations in documentation in #867 - apart from that we can't really do much more about this issue |
Please keep it open, because it's not fixed. What we can and should do, once we've switched to iris 2, is go over all preprocessor functions carefully and try to change them so they work with dask arrays instead of numpy arrays. This should solve most memory usage issues. |
can do, was thinking about creating a new issue about that, slightly cleaner ie you dont have to scroll all the way down past all sorts of theoretical gibberish |
This looks like a duplicate of #32, can we close? |
I think not, there are more ways to reduce memory usage than using iris and dask. Some things will have to be implemented in iris too, e.g. lazy regridding/vertical level interpolation. For multimodel statististics we could consider saving data to disk more to reduce memory use, though that will be another performance problem. |
It looks like someone has optimised the memory usage of the preprocessor in the last few months - my code with ~1500 input files used to run out of memory and crash when I ran in one go in the spring, now it works! Thanks! |
It looks like most of the issues discussed here (lazy regridding, lazy vertical level interpolation, lazy multimodel statistics, an overview of memory usage, etc.) has been fixed in previous releases. Remaining problems can be discussed in #674 or in dedicated issues. |
This issue was raised in pr #763 and was also raised by Rob Parker, from the UKESM project.
The recipe
recipe_ocean_bgc.yml
memory explodes when the time range is changed to a more realistic range. The time range is currently 2001-2004, but we may be interested in, for instance, the range 1950-2010.The second problem is that the warning messages and error messages related to Memory problems in ESMValTool are unclear. It's not obvious which part of this recipe causes the memory problems.
The text was updated successfully, but these errors were encountered: