Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JP-3741: Faster temporary file I/O for outlier detection in on-disk mode #8782

Merged
merged 51 commits into from
Sep 27, 2024

Conversation

emolter
Copy link
Collaborator

@emolter emolter commented Sep 12, 2024

Resolves JP-3741
Resolves JP-3762
Resolves JP-3764

Closes #8774
Closes #8829
Closes #8834

This PR has expanded to become a relatively major refactor of the outlier detection step. It achieves the following:

  1. Makes the outlier detection step run faster when in_memory=False by decreasing the data volume that needs to be read from disk during the median calculation. Prior to this PR, the get_sections iterator was loading the entirety of each resampled datamodel every time it extracted one spatial segment. This led to lots of unnecessary file I/O: if you had, say, 30 resampled images of 100 MB for which you wanted to compute the median in 50 sections, get_sections would read 30x50x100 MB of data in order to perform an operation that required just 30x100 MB as input. This PR makes it so each model is only read once (instead of n_sections times) by storing each section in its own appendable on-disk data array. For the example above in the new code, 50 on-disk arrays will be instantiated at the start of the operation, and then write_sections will load each resampled datamodel only once, appending (writing) one spatial section of that datamodel to each of the on-disk arrays until the whole datamodel is stored as part of 50 time stacks. Each of the 50 on-disk arrays has final shape (n_images, section_nrows, section_ncols). Each of these is then read one-by-one and the median is computed and stored for that section. The requisite 30x100 MB of temporary storage is saved only once and loaded only once. The on-disk arrays are deleted at the end of processing. A runtime comparison can be seen on the JIRA ticket: the amount of time spent on file I/O decreased by a factor of 100 for my test dataset.
  2. Refactors the resampling in outlier detection such that each drizzled_model (one per group) could be written into the OnDiskMedian as soon as it was computed. This is better than saving it in a "median-unfriendly" way and then immediately loading it just to re-save it into a more "median-friendly" file structure. This refactor also made it unnecessary to store all the drizzled models in memory at once when in_memory=True - now only the data extension needs to be kept for median computation.
  3. Leaves the behavior of the resample_many_to_many function unchanged. It was also discussed among Nadia, Mihai, Brett, Mairan, and I whether it's necessary to keep resample_many_to_many around given that this PR bypasses it in the outlier detection step. Its removal would make it so ResampleStep and ResampleSpecStep could not be run in "single" mode, so the change in behavior might extend beyond outlier detection usages. We think the answer is that it can probably be removed, but it would be outside the scope of this PR.
  4. Adds some memory-saving options to np.nanmedian. A memory usage comparison can be seen on the JIRA ticket: peak memory usage with in-memory=True went from 34 GB on main to 15.4 GB on this PR branch for one test association (see the ticket here), and from 21 GB to 8.5 GB for a different one (see the write-up here).
  5. Fixes intermediate file names in spectroscopic modes. Prior to this PR, the _median, _blot, and _outlier_?2d models would be saved to the same filename for all slits in a MultiSlitModel, overwriting each other. This PR adds the slit name to the filename. Separate ticket was made to ensure this will be specifically tested by INS.

Tasks

  • request a review from someone specific, to avoid making the maintainers review every PR
  • add a build milestone, i.e. Build 11.3 (use the latest build if not sure)
  • Does this PR change user-facing code / API?
    • add an entry to CHANGES.rst within the relevant release section (otherwise add the no-changelog-entry-needed label to this PR)
    • update or add relevant tests
    • update relevant docstrings and / or docs/ page
    • start a regression test and include a link to the running job (click here for instructions)
      • Do truth files need to be updated ("okified")?
        • after the reviewer has approved these changes, run okify_regtests to update the truth files
  • if a JIRA ticket exists, make sure it is resolved properly

@emolter
Copy link
Collaborator Author

emolter commented Sep 12, 2024

initial set of regression tests running here

Copy link

codecov bot commented Sep 12, 2024

Codecov Report

Attention: Patch coverage is 95.52846% with 11 lines in your changes missing coverage. Please review.

Project coverage is 61.86%. Comparing base (e860360) to head (2d5281c).
Report is 459 commits behind head on main.

Files with missing lines Patch % Lines
jwst/outlier_detection/utils.py 97.20% 5 Missing ⚠️
jwst/resample/resample.py 90.62% 3 Missing ⚠️
jwst/outlier_detection/coron.py 0.00% 1 Missing ⚠️
jwst/outlier_detection/spec.py 80.00% 1 Missing ⚠️
jwst/outlier_detection/tso.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8782   +/-   ##
=======================================
  Coverage   61.86%   61.86%           
=======================================
  Files         377      377           
  Lines       38911    38952   +41     
=======================================
+ Hits        24071    24097   +26     
- Misses      14840    14855   +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@emolter emolter changed the title WIP: JP-3741: Faster temporary file I/O for outlier detection in on-disk mode JP-3741: Faster temporary file I/O for outlier detection in on-disk mode Sep 13, 2024
@emolter emolter marked this pull request as ready for review September 13, 2024 15:05
@emolter emolter requested a review from a team as a code owner September 13, 2024 15:05
@emolter
Copy link
Collaborator Author

emolter commented Sep 13, 2024

Most recent changes fixed the bug (at least, locally on my machine) uncovered by the regression tests, but I'll wait for some reviews before spamming Jenkins with another regression test run

buffer_size = None
else:
# reasonable buffer size is the size of an input model, since that is
# the smallest theoretical memory footprint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering where this limit came from.

I think since the drizzled models are in memory (for in_memory=True) we will have to have at least 1 drizzled model in memory. We'll also always have all the input models in memory. So during generation of the 1st drizzled models we will have some baseline memory usage + all input models + 1 drizzled model. If we split up the drizzled model and write it out to many files (what this PR does, right? I haven't finished looking at it...) we can throw out the drizzled model before generating the next one (not what this PR does but what we might work towards). So our peak memory usage will be: baseline + n input models + 1 drizzled models. After drizzling we can go back down to baseline + n input models. We will have to then create a median image (equal to 1 drizzled array, but not model since we don't need weights etc) and then load each temporary file and fill in the median image. I think that gives a peak memory usage of: baseline + n input models + median image. I think this is less than baseline + n input models + 1 drizzled models by at least the weight array (the same size as the drizzled data). That's a very long winded way of saying, can't this be the size of the drizzled array and not the input array? If so, I think it would simplify the API as the buffer size could be computed within create_median (as it's written now).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that we can use the drizzled array size, so I will change that.

I'm almost certainly missing something, but why do all the input models need to be in memory? Do you mean within resample.do_drizzle or somewhere else?

Copy link
Collaborator

@braingram braingram Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. My Friday brain was conflating this with the in memory case. Ignore the bit about the n input models. The statements about the memory peak still hold (I think, I could be missing something and haven't tested this).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default buffer size computation should be scaled by the size of the resampled images now, let me know if it looks ok to you

filestem : str
The stem of the temporary file name. The extension ".bits" will be added.
"""
self._temp_dir = tempfile.TemporaryDirectory(dir=tempdir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this writes the tempdir to the current working directory by default - is that desired? And might it lead to more collisions in ops? The output directory may not be the same as the current directory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should lead to collisions because the temporary directory should be unique, even though yes by default it's a subdirectory within the cwd. I have no idea what default would be desired in ops, but I believe (@braingram can tell me if this is not correct) ModelLibrary makes its temporary directory in the same way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ModelLibrary uses the same pattern for temporary files:
https://github.com/spacetelescope/stpipe/blob/ecd5d162be425c24db2498b34bcbaeccec4ac557/src/stpipe/library.py#L181
Although currently those aren't used in ops (as far as I'm aware the libraries should always be on_disk=False except for the one in resample which points to the already generated models in the output directory). Finding (and configuring) a suitable tmpdir is likely worth exploring if we don't disable all tempfiles for ops.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that seems like it might be a little problematic. Why not just leave the default at None, so that the temporary directory is the system-dependent default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the default to None for the DiskAppendableArray in this PR makes sense to me, will do. Any change to tempdirs for ModelLibrary is beyond the scope of this PR, but let's discuss more if this is something you'd like to see done

Copy link
Collaborator

@braingram braingram Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are issues with using the default /tmp on systems in ops. @jhunkeler may know more details but I believe that /tmp is not writeable for "security" reasons. What about using the same directory that is currently used on main for the temporary models?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can ask team Coffee about this at our 3:00 meeting

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@melanieclarke we talked about this with Hien, Eric, and Jesse. It sounds like there's not really any "good" place to do file I/O on their end. We confirmed though that right now all temporary files are written to the current working directory, including those made by the old (but currently operational) ModelContainer when in_memory=False as is the current outlier detection default. Based on that conversation I think leaving this as the default is likely the safest thing to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@braingram Sorry I missed this yesterday. The issue is that /tmp is mounted with the noexec flag:

cat << EOF > /tmp/myscript.sh
#!/usr/bin/env bash
echo "hello world"
EOF
chmod +x /tmp/myscript.sh
/tmp/myscript.sh
# result
-bash: /tmp/myscript.sh: Permission denied

Copy link
Collaborator

@mairanteodoro mairanteodoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks, @emolter!

Copy link
Collaborator

@melanieclarke melanieclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything looks good now. Thanks very much to @emolter and @braingram for hashing this out together - I agree that this is a significant improvement to the step!

@melanieclarke
Copy link
Collaborator

@emolter - can we re-run the full regression test set before we merge?

@zacharyburnett - are you happy with the updates to the change log?

@emolter
Copy link
Collaborator Author

emolter commented Sep 26, 2024

started regression tests here

@nden
Copy link
Collaborator

nden commented Sep 27, 2024

I haven't looked at the details of the PR but are any updates to the outlier_detection docs needed?

@emolter
Copy link
Collaborator Author

emolter commented Sep 27, 2024

I haven't looked at the details of the PR but are any updates to the outlier_detection docs needed?

Actually, yes, there's one bullet in outlier_detection_imaging that is wrong now about the section size. I will update that now

@melanieclarke
Copy link
Collaborator

started regression tests here

Looks good, thanks!

@emolter
Copy link
Collaborator Author

emolter commented Sep 27, 2024

@nden how does 2d5281c look? Not sure whether so many implementation details should be in the user-facing docs, but at least the mention of a 1 MB buffer size has been removed.

Copy link
Collaborator

@mairanteodoro mairanteodoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the docs, @emolter!

@emolter
Copy link
Collaborator Author

emolter commented Sep 27, 2024

I haven't looked at the details of the PR but are any updates to the outlier_detection docs needed?

Going to go ahead and merge this, assuming the small change I made is good enough to at least remove incorrect information. Additional docs changes to outlier detection can be deferred to #7817

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants