-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JP-3706: Set outlier detection buffer size based on total available memory #8756
Conversation
…assign_mtwcs to library" This reverts commit 018e3ce.
… duplicated utilities
Since allowed memory defaults to None does that mean that outlier detection will now use up to half the available memory for the median (where before it was 10 MB)? It looks like the "memory" is including swap so it will likely start swapping before using a smaller buffer. It's also hard to think how this new use and the existing use of "allowed memory" (for resample called by outlier detection) will interact. Are there options to:
|
Yes, it will use half the available memory in total (or whatever we set the default to). But note that before, it was using 10 MB per model, so the total memory was not well defined.
We could do this of course. But I don't know what you mean about interactions between them. The only thing the
Yes, this is a good idea. Do you think it would be better to set the default to some fraction of the total without swapping, or just decrease the number? And what do you think is reasonable? TBH I have no sense for this. |
One scenario is a user sets The docs also mention the use of the environment variable I'm not sure picking a one-size-fits-all buffer size will ever work. Zarr defaults to 1MB chunks: https://zarr.readthedocs.io/en/v2.1.0/tutorial.html#chunk-size-and-shape but our uses here are encumbered by the datamodels requiring loading substantially more than the chunk. |
Okay, sure, but in the scenario you outlined I feel as though the real problem is that when the user set the allowed memory in the first instance, it was exceeded without tripping our custom
That is not currently supported by this PR and I agree that whether this should be supported depends on if we use an additional/separate parameter for the allowed memory in the median computation.
I feel as though having a patchwork of multiple different parameters governing different aspects of the memory usage is confusing for users. If I were running the pipeline, I would want to be able to specify a total memory allocation not to be exceeded. This might be more challenging to implement, but it's not obvious to me why it would never work. |
I am in favor of exposing control of the "running-median" buffer size to the user and I think a different option is easier to understand (and document) than trying to re-use As far as the default I think 1 or 10 MB per model is closer to what we want (ideally) instead of 50% of remaining RAM and swap (which will almost certainly swap). This is likely inefficient for the current code (which uses datamodels for the temporary storage). This is separate from this PR but I think we could see (possibly bigger) efficiency gains if we instead used "vanilla" numpy arrays for the temporary storage. For that configuration I don't know for certain but I have a hunch (based on the zarr default) that a smaller buffer size would be better (or perhaps picking one that most closely matches a page size). What do you think about changing this PR to:
|
I think we need some additional opinions here. I don't think my questions/concerns from my previous comment are addressed by your reply, so let me try to add some info and context so we can hopefully resolve this. Overall, I'm of the fairly strong opinion that regardless of whether we end up with one parameter or two, there is no reason why the default buffer size should be any absolute number of megabytes per model. The two numbers that are relevant here are the available memory on the machine and the number of models, and the program can figure out both. It seems to me that the fastest processing would occur with the minimum number of load operations, which would happen when all available memory is being used (with the caveat that we should avoid using swap as you pointed out). I don't have any problem with separating this parameter from I understand your concern that combining the parameters might lead to confusion, but let me give you an example where separating them leads to confusion: say you're processing 100 images with substantial spatial overlap. Say each image is 50 MB. The output resampled model might not be much larger than each individual integration, call it 250 MB. The |
Thanks for the extra details. I didn't mean to dismiss (or miss) any questions or concerns.
One reason is that the file IO will occur as blocks. Assuming all disk IO is using a 4k block size and the stars align so that one section from one model is 4k, the disk IO for computing the median using 4k from each model would be the same as reading in all models (assuming the OS doesn't over-read the file to try to fill a memory page). If we pick a section that is too small (<4k) than each section will result in reading in a block (4k) and then throwing out the extra memory. Picking a bigger section is less of an issue for IO (but the ideal would still be block aligned, 8k, 12k etc).
Thanks for the details here. Extending this description, assuming infinite RAM the best option would be to set So for the median calculation if we consider the:
If "section" is too small (smaller than a filesystem block) IO will likely limit performance. Are we in agreement that we want to avoid those 2 conditions? If so, I think the question is where do we target in between those two extremes?
I agree that |
I totally agree. And I guess it would be good to issue a warning if a user parameter causes the code to run up against either extreme. As far as where to target, I assumed that one would want to use as much memory as possible (without using swap) so that the number of sections per model is minimized (and therefore the models need to be loaded as few times as possible. Do you agree with that philosophy? |
I'd so no :) One example is our regression tests. They're currently run in parallel. If outlier detection decides to use all the RAM any other parallel test will start swapping when it tries to allocate memory. Another example is HPC users who might have jobs tuned for processing jwst data. Does psutil take into consideration memory limits for job queueing systems like slurm? If a user has a job with a defined (often required) memory limit if we make the pipeline use a buffer based on the free RAM (instead of the memory limit) we're likely to pick a value higher than the limit and get the job killed. |
These are fair points. If I hear what you're saying, there is no "good" default that works in every case, and I think I agree. But isn't that sort of the point of exposing the parameter? The two use-cases you brought up are somewhat "custom" in the sense we wouldn't probably expect most users to be doing that. In the case that someone is running jobs in parallel, they would be most likely to scale the memory available to each node according to the number of nodes. So I'd assume it would still be desirable to set the total allowed memory by outlier detection, and the parameter should therefore parametrize that total. Similarly for the queuing system. In neither case does it make sense to me to parametrize a per-model memory allocation, unless you were somehow computing the median itself in parallel over the chunks (which we do not support). I could see a case for no default at all, i.e., if |
I'm all for exposing some parameter. I don't think either of these are uncommon use cases (the regtests run every night and of the few help desk issues I've seen and the github issues from external users many of them were running on HPCs).
Currently there is no parameter exposed. We agree that exposing something could be useful. I think the options at this point in the conversation are:
These are similar to the 2 extremes we'd like to avoid mentioned above:
If we pick only one, we accept that the other won't be configurable and there are circumstances where that will be problematic. What if we picked both with "sensible" defaults?
As far as I know, no default settings prior to this PR set a limit on the maximum memory used by the pipeline. Is that your understanding as well?
I don't think that will work as I think my main concern for this PR is the change in the default behavior. Where on main the outlier detection median computation is largely IO bound (and memory efficient) with an option to disable this by setting |
Maybe I'm missing something, but in the proposal above, isn't buffer size simply section size * number of models? Don't they depend on one another? What is the point of setting section_size to anything smaller than buffer size / number of models?
The code currently checks if the size of the output array for resample is larger than the available memory (virtual+swap)*
But the code runs much faster when we push it toward the memory-hungry side. See my comment on the JIRA ticket. And FWIW I think the default should be changed to |
Yeah they depend on one another. On main section_size is hard-coded and buffer size is computed. With this PR buffer size has a default value (50% ram and swap), is user configurable and section_size is computed. We have to pick (at least) one and give it a default. I think your question assumes we're exposing only buffer size (and computing section size). If we go with this approach, is there an alternative to using some % of free ram and swap for the default (which as discussed above has issues)? If not, I think that argues that picking a default section size makes more sense as we can default it to a small typical block size.
That's only used if jwst/jwst/resample/resample.py Lines 155 to 158 in 0e86465
I'll take a look at the ticket. Did you try using numpy arrays and not datamodels as temporary files? |
No, I just got confused about your comment that we should expose both. I cannot picture a scenario where both would be tuned separately.
No, that is beyond the scope of this PR.
I think we still differ in opinion on how serious those issues are. For the regression tests, I would argue that
Repeating my findings from the ticket, there was a factor-of-7 speedup when changing the section size from ~1 MB to ~100 MB. I'm very hesitant to set a default behavior that throws away that speedup. |
log.info("Computing median in chunks to save memory") | ||
if allowed_memory is None: | ||
allowed_memory = 0.5 | ||
machine_available_memory = psutil.virtual_memory().available + psutil.swap_memory().total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why swap memory is included. Unless I misunderstand what this refers to (and I may very well misunderstand), I don't think the size of the swap area should be included. We are most interested in how much real memory is available so no paging or swapping is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing this. Brett pointed out the same thing, and it's buried somewhere in our long back-and-forth (understandable that you did not read it all lol). If the swap memory were removed, what is your take on whether it's better to use the total amount of available memory or a per-model section size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just noticed the long discussion "below the fold". I had some general concerns as well along Brett's lines.
I'm not suggesting that we do this but I'd like to provide a point of comparison with the approach taken in this PR. It's my interpretation that increasing the memory footprint for outlier detection is primarily to work around the poor IO performance for the median calculation. I believe this is primarily because datamodels are used as the temporary storage. Here's a branch that uses zarr for the median calculation: Running the example association discussed in the linked JP ticket with this branch takes 76.4 seconds to compute the median when using a 1 MB chunk size (and 76.8 for a 10 MB) chunk size. Yes this type of change is outside the scope of this PR but I believe it addresses the root cause and is much less sensitive to this parameter (the JP ticket mentions runtimes of ~197 seconds for a 7 MB section size and 159 for a 100 MB section size). It is also closer to the "in memory" performance of 33 seconds as mentioned in the JP ticket. |
I agree completely.
This makes a lot of sense, because loading and saving datamodels has lots of overhead.
I agree this gets around the problem. If we could decrease the sensitivity of the runtime to the chunk size, then I would have absolutely no issues with your proposed solution of leaving the buffer size a per-model quantity. In fact, in that case I might be in favor of not exposing any parameter at all, and leaving everything else as it is on main. The whole reason I wrote the ticket was because of strong sensitivity of runtime on that parameter, but if that went away, I don't see a good reason we couldn't get away with, as you suggested in the first place, relatively small chunk sizes. Then we could, for example, add a warning when In conclusion I think that's a great idea and it doesn't seem particularly difficult, besides that we would need to do a small amount of wheel-reinventing to rebuild what zarr currently does. If we are going to do that eventually, I'd say we should do it right away instead of putting a band-aid solution in there - the 10 MB buffer that's the current behavior hasn't broken anything I'm aware of, so even if that doesn't make it into the next build I don't think it's the end of the world. What do you think of that solution @perrygreenfield and @nden ? |
I'm going to add a few general comments on the above discussion, which I have not read in complete detail but I think I get most of the gist of it. Like Brett, I am wary of making a fraction of available memory a default since we don't know the context that the code will be run under (e.g., parallel processses, or other memory intensive tasks that will be run after this starts. I'm also wary of focusing on a per model memory allocation, since that scales with the number of models, and it is the large model case that is causing problems. I think I would prefer a size of the all the sections as the allocation limit. Since we now have computers with GB's of memory, allocating any limit that results in 1 MB per model or anything in that range seems way too small as a default. With the current I/O that will only end up with massive rereads of FITS data extensions (correct?). Anything to minimize that would be a good thing. Figuring out how much memory a process should use is fraught with complications unless one is in control of all the processes that may be running and knows what the other processes need, which is not usually the case with most users (and may not be the case even in well controlled operations). But having the option to specify it that way probably is a useful thing, but shouldn't be the default in my opinion. |
Thanks for the response and suggestions Perry. When you say
does that mean you would be in favor of making |
It's been a while. Remind me what |
If we don't use FITS we can make it not a problem :) The zarr prototype computes the chunk size based one (n_models, n_rows, n_cols) where n_rows and n_cols is either 1 or a larger number keeping the total chunk size < 1 (or 10 MB). This shape was chosen so that the median calculation can occur for each chunk (each chunk is read only once) but this does make writing less efficient since each model will result in writing all chunks (there are likely ways to improve this since some of data will be nan so we don't need to write those chunks). Let me try to explain the current idea in my head. I doubt it's complete and help in pointing out issues and improvements is appreciated. For the current
The most efficient
This would allow reading and writing temporary files only once (although writing would be incremental through the appending of each section) and the total memory used would be largely based on the median image size. I think a sensible default for the temporary file size would be the same as the input data size but maybe there's a different optimal. I hope that makes sense but please let me know if there are any questions. I'm not 100% certain the appending sections to the temporary files will work but I think it should be doable (I just haven't done it yet so there might be an array ordering issue). |
When |
Regarding |
Regarding @braingram comment about the False case, that is essentially what I had in mind, at least as a temporary solution around the problems of reopening the existing format(s). |
This PR is being withdrawn in light of the fact that the memory handling should be changed so as to make the runtime of the step insensitive to the default chunk size; see #8774 |
Resolves JP-3706
Closes #8708
To save memory while computing the median over a large number of input models in outlier_detection, models are loaded in spatial sections using
outlier_detection.utils.get_sections()
. The size of these sections used to be governed by thebuffer_size
parameter, which was hard-coded to a default value expressed in megabytes that controlled the size of section that is loaded per-model. This was a clunky way of expressing this, and led to many more model read-write operations (and therefore much longer runtimes than necessary) when the total memory allocation, i.e., buffer size times number of models, was much smaller than the available memory on the machine.This PR re-purposes the outlier detection step's pre-existing
allowed_memory
parameter, which specifies the amount of allowed memory usage as a fraction of the total available memory, to specify the size of the sections loaded into memory. Thebuffer_size
parameter inget_sections
is now computed asavailable_memory / len(resampled_models.group_names) / _ONE_MB
(it expects a number in megabytes).Checklist for PR authors (skip items if you don't have permissions or they are not applicable)
CHANGES.rst
within the relevant release sectionHow to run regression tests on a PR