-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JP-3741: Faster temporary file I/O for outlier detection in on-disk mode #8782
Conversation
initial set of regression tests running here |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8782 +/- ##
=======================================
Coverage 61.86% 61.86%
=======================================
Files 377 377
Lines 38911 38952 +41
=======================================
+ Hits 24071 24097 +26
- Misses 14840 14855 +15 ☔ View full report in Codecov by Sentry. |
Most recent changes fixed the bug (at least, locally on my machine) uncovered by the regression tests, but I'll wait for some reviews before spamming Jenkins with another regression test run |
jwst/outlier_detection/spec.py
Outdated
buffer_size = None | ||
else: | ||
# reasonable buffer size is the size of an input model, since that is | ||
# the smallest theoretical memory footprint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering where this limit came from.
I think since the drizzled models are in memory (for in_memory=True) we will have to have at least 1 drizzled model in memory. We'll also always have all the input models in memory. So during generation of the 1st drizzled models we will have some baseline memory usage + all input models + 1 drizzled model. If we split up the drizzled model and write it out to many files (what this PR does, right? I haven't finished looking at it...) we can throw out the drizzled model before generating the next one (not what this PR does but what we might work towards). So our peak memory usage will be: baseline + n input models + 1 drizzled models. After drizzling we can go back down to baseline + n input models. We will have to then create a median image (equal to 1 drizzled array, but not model since we don't need weights etc) and then load each temporary file and fill in the median image. I think that gives a peak memory usage of: baseline + n input models + median image. I think this is less than baseline + n input models + 1 drizzled models by at least the weight array (the same size as the drizzled data). That's a very long winded way of saying, can't this be the size of the drizzled array and not the input array? If so, I think it would simplify the API as the buffer size could be computed within create_median (as it's written now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that we can use the drizzled array size, so I will change that.
I'm almost certainly missing something, but why do all the input models need to be in memory? Do you mean within resample.do_drizzle
or somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. My Friday brain was conflating this with the in memory case. Ignore the bit about the n input models. The statements about the memory peak still hold (I think, I could be missing something and haven't tested this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default buffer size computation should be scaled by the size of the resampled images now, let me know if it looks ok to you
jwst/outlier_detection/utils.py
Outdated
filestem : str | ||
The stem of the temporary file name. The extension ".bits" will be added. | ||
""" | ||
self._temp_dir = tempfile.TemporaryDirectory(dir=tempdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this writes the tempdir to the current working directory by default - is that desired? And might it lead to more collisions in ops? The output directory may not be the same as the current directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should lead to collisions because the temporary directory should be unique, even though yes by default it's a subdirectory within the cwd. I have no idea what default would be desired in ops, but I believe (@braingram can tell me if this is not correct) ModelLibrary
makes its temporary directory in the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah ModelLibrary uses the same pattern for temporary files:
https://github.com/spacetelescope/stpipe/blob/ecd5d162be425c24db2498b34bcbaeccec4ac557/src/stpipe/library.py#L181
Although currently those aren't used in ops (as far as I'm aware the libraries should always be on_disk=False except for the one in resample which points to the already generated models in the output directory). Finding (and configuring) a suitable tmpdir is likely worth exploring if we don't disable all tempfiles for ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that seems like it might be a little problematic. Why not just leave the default at None, so that the temporary directory is the system-dependent default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the default to None for the DiskAppendableArray in this PR makes sense to me, will do. Any change to tempdirs for ModelLibrary is beyond the scope of this PR, but let's discuss more if this is something you'd like to see done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there are issues with using the default /tmp
on systems in ops. @jhunkeler may know more details but I believe that /tmp
is not writeable for "security" reasons. What about using the same directory that is currently used on main for the temporary models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can ask team Coffee about this at our 3:00 meeting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@melanieclarke we talked about this with Hien, Eric, and Jesse. It sounds like there's not really any "good" place to do file I/O on their end. We confirmed though that right now all temporary files are written to the current working directory, including those made by the old (but currently operational) ModelContainer
when in_memory=False
as is the current outlier detection default. Based on that conversation I think leaving this as the default is likely the safest thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@braingram Sorry I missed this yesterday. The issue is that /tmp
is mounted with the noexec
flag:
cat << EOF > /tmp/myscript.sh
#!/usr/bin/env bash
echo "hello world"
EOF
chmod +x /tmp/myscript.sh
/tmp/myscript.sh
# result
-bash: /tmp/myscript.sh: Permission denied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks, @emolter!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everything looks good now. Thanks very much to @emolter and @braingram for hashing this out together - I agree that this is a significant improvement to the step!
@emolter - can we re-run the full regression test set before we merge? @zacharyburnett - are you happy with the updates to the change log? |
started regression tests here |
I haven't looked at the details of the PR but are any updates to the outlier_detection docs needed? |
Actually, yes, there's one bullet in |
Looks good, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the docs, @emolter!
Going to go ahead and merge this, assuming the small change I made is good enough to at least remove incorrect information. Additional docs changes to outlier detection can be deferred to #7817 |
Resolves JP-3741
Resolves JP-3762
Resolves JP-3764
Closes #8774
Closes #8829
Closes #8834
This PR has expanded to become a relatively major refactor of the outlier detection step. It achieves the following:
in_memory=False
by decreasing the data volume that needs to be read from disk during the median calculation. Prior to this PR, theget_sections
iterator was loading the entirety of each resampled datamodel every time it extracted one spatial segment. This led to lots of unnecessary file I/O: if you had, say, 30 resampled images of 100 MB for which you wanted to compute the median in 50 sections,get_sections
would read 30x50x100 MB of data in order to perform an operation that required just 30x100 MB as input. This PR makes it so each model is only read once (instead ofn_sections
times) by storing each section in its own appendable on-disk data array. For the example above in the new code, 50 on-disk arrays will be instantiated at the start of the operation, and thenwrite_sections
will load each resampled datamodel only once, appending (writing) one spatial section of that datamodel to each of the on-disk arrays until the whole datamodel is stored as part of 50 time stacks. Each of the 50 on-disk arrays has final shape(n_images, section_nrows, section_ncols)
. Each of these is then read one-by-one and the median is computed and stored for that section. The requisite 30x100 MB of temporary storage is saved only once and loaded only once. The on-disk arrays are deleted at the end of processing. A runtime comparison can be seen on the JIRA ticket: the amount of time spent on file I/O decreased by a factor of 100 for my test dataset.drizzled_model
(one per group) could be written into theOnDiskMedian
as soon as it was computed. This is better than saving it in a "median-unfriendly" way and then immediately loading it just to re-save it into a more "median-friendly" file structure. This refactor also made it unnecessary to store all the drizzled models in memory at once whenin_memory=True
- now only the data extension needs to be kept for median computation.resample_many_to_many
function unchanged. It was also discussed among Nadia, Mihai, Brett, Mairan, and I whether it's necessary to keepresample_many_to_many
around given that this PR bypasses it in the outlier detection step. Its removal would make it soResampleStep
andResampleSpecStep
could not be run in "single" mode, so the change in behavior might extend beyond outlier detection usages. We think the answer is that it can probably be removed, but it would be outside the scope of this PR.np.nanmedian
. A memory usage comparison can be seen on the JIRA ticket: peak memory usage within-memory=True
went from 34 GB on main to 15.4 GB on this PR branch for one test association (see the ticket here), and from 21 GB to 8.5 GB for a different one (see the write-up here)._median
,_blot
, and_outlier_?2d
models would be saved to the same filename for all slits in a MultiSlitModel, overwriting each other. This PR adds the slit name to the filename. Separate ticket was made to ensure this will be specifically tested by INS.Tasks
Build 11.3
(use the latest build if not sure)CHANGES.rst
within the relevant release section (otherwise add theno-changelog-entry-needed
label to this PR)docs/
pageokify_regtests
to update the truth files