Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outlier detection get_sections should set buffer size based on total available memory #8708

Open
stscijgbot-jp opened this issue Aug 15, 2024 · 11 comments

Comments

@stscijgbot-jp
Copy link
Collaborator

Issue JP-3706 was created on JIRA by Ned Molter:

To save memory while computing the median over a large number of input models in outlier_detection, models are loaded in spatial sections using outlier_detection.utils.get_sections_library().  The size of these sections is governed by the buffer_size parameter in outlier_detection.utils.create_median_library(), which is set to a default value expressed in megabytes that controls how big a section is loaded per-model.  This is a clunky way of expressing this, and leads to many more model read-write operations (and therefore much longer runtimes than necessary) when the total memory allocation, i.e., buffer size times number of models, is much smaller than the available memory on the machine.

The buffer_size should be set based on the desired maximum memory usage, either manually or automatically.

@stscijgbot-jp stscijgbot-jp changed the title outlier detection get_sections should guess its own buffer size outlier detection get_sections should set buffer size based on total available memory Aug 15, 2024
@stscijgbot-jp
Copy link
Collaborator Author

stscijgbot-jp commented Aug 26, 2024

Comment by Ned Molter on JIRA:

I wrote a draft implementation of this, starting from the ModelLibrary PR JP-3690, that works as follows:

  • The pre-existing allowed_memory parameter, defined as the fraction of available memory that the user wants to allocate to the pipeline, is now used to govern the chunk size in create_median when on_disk is True (previously this was only used to check if there was enough available memory for resampling
  • Default of allowed_memory for the step is set to 0.5, i.e., use 50% of available memory on the machine
  • The total available memory on the machine is computed as virtual memory plus swap memory, which mirrors the behavior in resample
  • The buffer_size parameter that already existed in create_median is computed as 
    available_memory / len(resampled_models.group_names) / _ONE_MB.  It expects a number expressed in megabytes.

 

I'm testing this on my local machine with a ~40-member association where each datamodel is roughly 100 MB.  To better understand memory usage in create_median itself, I've temporarily hard-coded the allowed_memory parameter in the resample part of outlier detection to be None, i.e., use all available memory.  Then I'm running the outlier detection step while changing allowed_memory.

 

Test 1: allowed_memory = 2e-4. Chunk size 1.4 MB per model; data will be read in 347 sections of 27 rows each.  This approximately mimics the old behavior, where chunk_size had been hardcoded to 1 MB.  The execution time for create_median is: 662 s

Test 2: allowed_memory = 0.03. Chunk size 180 MB per model; data will be read in 3 sections of 3784 rows. The execution time for create_median is: 99 s

 

It can be seen that the runtime balloons when the chunks are small and require many load operations, and the proposed fix appears to alleviate that to the extent possible.

I also checked the results of a memory profiler for both runs to ensure that the amount of total memory usage during create_median is equal to the machine's total available memory times the allowed_memory parameter.  This sanity check passes, so I think this is working as intended.  Note here that the median image array is pre-allocated regardless of the chunk size, and when the chunks are very small, that sets the memory usage during create_median.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Brett Graham on JIRA:

Would you point me to the association, data files and ideally the script used to run the tests? I'd like to try running this to see the performance (and try some different optimizations).

Also, what is the execution time for in_memory=True?

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

Attaching the asn and the script.  The data are a subset of the same dataset we were using to test ModelLibrary for JP-3690.  I'll check on in_memory True

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Brett Graham on JIRA:

Thanks for sharing the script. I just ran it using the code on jwst main and create_median took 175 seconds. Is there a chance the tests you ran used a network drive? I attached the profile and a screenshot of the rendered profile (using snakeviz):
!Screenshot 2024-09-10 at 1.53.59 PM.png|thumbnail!

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

Thanks for looking at the profiling.  I should have said that I put the timing wrapper around create_median.  I don't know why it took your machine a bit longer than mine.  But you are now most likely deeper into this than I was at the time.  I'm curious to see what you come up with.

 

If I understand this correctly though, it's true that the primary time usage comes from file i/o

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Brett Graham on JIRA:

The profile was generated with jwst main (not your PR) so the 175 seconds I got was much less than the 662. Any idea why it's so much slower on your machine?

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

oh, I see.  That was with 1 MB section size?  I will try to figure out the reason.

Are you seeing a more gradual change with the section size than I did with this very quick-and-dirty test?

@braingram
Copy link
Collaborator

The profile was with the main branch using the default 10 MB hard-coded "buffer_size":

def create_median(resampled_models, maskpt, on_disk=True, buffer_size=10.0):

I'll keep poking as time allows.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

Longer update soon, but the difference is almost certainly that the 662 seconds was using a buffer size of 1 MB per model

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

Ok, I'm adding some more runtime profiling plots here.  The upshot is that the 10-MB-per-model buffer works pretty well for this association, but that's seemingly by accident and might change for a different group of models.  It appears that the runtime is always shorter the more memory you allocate, but it's very non-linear.  If you go to a smaller per-model buffer the runtime starts to explode.  

 

I should note here that the 10 MB was not calculated or otherwise tested close to "optimal" according to some metric.  I know this because I changed the default by accident from 1 MB to 10 MB in the ModelLibrary PR. (Here's the git blame from beforehand.  When I say it was an accident, I mean I changed the value locally because I was sick of waiting around for tests to run, and forgot to revert it.)

 

The screenshots of snakeviz outputs here should be relatively clear from their filenames.  The first was created with in_memory=True; the other three are labeled with a number of megabytes that corresponds to the per-model buffer size.  All of these were created with my PR branch by modifying the allowed_memory parameter.  I disabled the other usage of allowed_memory that checks the size of the drizzled output in order to make the smallest buffer size run.  I also removed the swap memory from the available memory calculation, which is why these look a bit different than the original tests.  The number of MB per model is computed and printed to the logs, and that's what I'm reporting here.  Those are ~100 MB, ~7 MB, and ~700 KB. 

 

 

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

This ticket is superseded by JP-3741 and JP-3744.  If the file I/O is made more efficient, then the runtime of the step should be insensitive to the specific choice of chunk size and the parameter will not need to be exposed.  With regards to the default memory handling behavior, this ticket has led to the proposal that in_memory=True should be the default behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants