-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outlier detection get_sections should set buffer size based on total available memory #8708
Comments
Comment by Ned Molter on JIRA: I wrote a draft implementation of this, starting from the ModelLibrary PR JP-3690, that works as follows:
I'm testing this on my local machine with a ~40-member association where each datamodel is roughly 100 MB. To better understand memory usage in create_median itself, I've temporarily hard-coded the allowed_memory parameter in the resample part of outlier detection to be None, i.e., use all available memory. Then I'm running the outlier detection step while changing allowed_memory.
Test 1: allowed_memory = 2e-4. Chunk size 1.4 MB per model; data will be read in 347 sections of 27 rows each. This approximately mimics the old behavior, where chunk_size had been hardcoded to 1 MB. The execution time for create_median is: 662 s Test 2: allowed_memory = 0.03. Chunk size 180 MB per model; data will be read in 3 sections of 3784 rows. The execution time for create_median is: 99 s
It can be seen that the runtime balloons when the chunks are small and require many load operations, and the proposed fix appears to alleviate that to the extent possible. I also checked the results of a memory profiler for both runs to ensure that the amount of total memory usage during create_median is equal to the machine's total available memory times the allowed_memory parameter. This sanity check passes, so I think this is working as intended. Note here that the median image array is pre-allocated regardless of the chunk size, and when the chunks are very small, that sets the memory usage during create_median. |
Comment by Brett Graham on JIRA: Would you point me to the association, data files and ideally the script used to run the tests? I'd like to try running this to see the performance (and try some different optimizations). Also, what is the execution time for |
Comment by Ned Molter on JIRA: Attaching the asn and the script. The data are a subset of the same dataset we were using to test ModelLibrary for JP-3690. I'll check on in_memory True |
Comment by Brett Graham on JIRA: Thanks for sharing the script. I just ran it using the code on jwst main and create_median took 175 seconds. Is there a chance the tests you ran used a network drive? I attached the profile and a screenshot of the rendered profile (using snakeviz): |
Comment by Ned Molter on JIRA: Thanks for looking at the profiling. I should have said that I put the timing wrapper around create_median. I don't know why it took your machine a bit longer than mine. But you are now most likely deeper into this than I was at the time. I'm curious to see what you come up with.
If I understand this correctly though, it's true that the primary time usage comes from file i/o |
Comment by Brett Graham on JIRA: The profile was generated with jwst main (not your PR) so the 175 seconds I got was much less than the 662. Any idea why it's so much slower on your machine? |
Comment by Ned Molter on JIRA: oh, I see. That was with 1 MB section size? I will try to figure out the reason. Are you seeing a more gradual change with the section size than I did with this very quick-and-dirty test? |
The profile was with the main branch using the default 10 MB hard-coded "buffer_size": jwst/jwst/outlier_detection/utils.py Line 35 in 4dfe982
I'll keep poking as time allows. |
Comment by Ned Molter on JIRA: Longer update soon, but the difference is almost certainly that the 662 seconds was using a buffer size of 1 MB per model |
Comment by Ned Molter on JIRA: Ok, I'm adding some more runtime profiling plots here. The upshot is that the 10-MB-per-model buffer works pretty well for this association, but that's seemingly by accident and might change for a different group of models. It appears that the runtime is always shorter the more memory you allocate, but it's very non-linear. If you go to a smaller per-model buffer the runtime starts to explode.
I should note here that the 10 MB was not calculated or otherwise tested close to "optimal" according to some metric. I know this because I changed the default by accident from 1 MB to 10 MB in the ModelLibrary PR. (Here's the git blame from beforehand. When I say it was an accident, I mean I changed the value locally because I was sick of waiting around for tests to run, and forgot to revert it.)
The screenshots of snakeviz outputs here should be relatively clear from their filenames. The first was created with
|
Comment by Ned Molter on JIRA: This ticket is superseded by JP-3741 and JP-3744. If the file I/O is made more efficient, then the runtime of the step should be insensitive to the specific choice of chunk size and the parameter will not need to be exposed. With regards to the default memory handling behavior, this ticket has led to the proposal that |
Issue JP-3706 was created on JIRA by Ned Molter:
To save memory while computing the median over a large number of input models in outlier_detection, models are loaded in spatial sections using outlier_detection.utils.get_sections_library(). The size of these sections is governed by the buffer_size parameter in outlier_detection.utils.create_median_library(), which is set to a default value expressed in megabytes that controls how big a section is loaded per-model. This is a clunky way of expressing this, and leads to many more model read-write operations (and therefore much longer runtimes than necessary) when the total memory allocation, i.e., buffer size times number of models, is much smaller than the available memory on the machine.
The buffer_size should be set based on the desired maximum memory usage, either manually or automatically.
The text was updated successfully, but these errors were encountered: