Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify ModelContainer #8738

Closed
stscijgbot-jp opened this issue Aug 29, 2024 · 5 comments · Fixed by #8831 or spacetelescope/stdatamodels#330
Closed

Simplify ModelContainer #8738

stscijgbot-jp opened this issue Aug 29, 2024 · 5 comments · Fixed by #8831 or spacetelescope/stdatamodels#330

Comments

@stscijgbot-jp
Copy link
Collaborator

stscijgbot-jp commented Aug 29, 2024

Issue JP-3721 was created on JIRA by Ned Molter:

With the addition of ModelLibrary, the role of ModelContainer needs to be re-evaluated.  ModelLibrary certainly renders the return_open and save_open options obsolete, as well as the get_sections method, since if memory usage is a concern then ModelLibrary should be used instead.

However, during discussions related to JP-3715, it has become increasingly clear that the strictness of ModelLibrary and the additional borrow/shelve code required to access models is not necessary/desired for many use-cases.  For example, the calwebb_spec3 pipeline currently makes extensive use of ModelContainer but does not have the same memory issues as calwebb_image3 does.  Learning to use ModelLibrary may also be a nuisance for users manipulating relatively small datasets.

Therefore, the need may remain for a lightweight, easy-to-use class for loading a list of models and association metadata.  This container should satisfy the following constraints:

  • Loads association metadata dictionary
  • Is a Sequence, i.e., can be manipulated as if it were a list
  • Loads all datamodels into memory (i.e., use ModelLibrary instead if we care about memory usage)
  • Is the default data structure loaded by datamodels.open() for asn-type data
  • Can be used as a context manager, i.e., with datamodels.open(asn.json) as container has the expected behavior

The plan for now is to draft this container by substantially slimming down ModelContainer, and see if it satisfies the needs of the JWST pipeline.

 

More context about ModelContainer issues can be found on this innerspace page: https://innerspace.stsci.edu/display/SCSB/ModelContainer+uses+in+jwst 

@stscijgbot-jp
Copy link
Collaborator Author

stscijgbot-jp commented Sep 5, 2024

Comment by Ned Molter on JIRA:

A draft of the slimmed-down new-look ModelContainer can be found here.  -I was able to make this work with the rest of the jwst repository and run all the pipelines successfully with only minor modifications.-  I'm listing the major changes, along with some questions about what is the desired behavior.  I'm curious to hear anyone's feedback.

  • Removed inheritance from JwstDataModel.  The main reason I did so is because the container isn't really itself a model, and this inheritance was mainly used in the pipeline to allow datamodels.open() to operate on containers.  A small change to stdatamodels makes this unnecessary.  If model metadata is needed, it's usually better/safer to pull this from one of the models in the container rather than the container itself, and indeed that's usually what is done in the pipeline anyway.  This change also allowed removal of everything schema-related in the container.
  • Removed save_open and return_open options in light of ModelLibrary handling memory usage more carefully
  • Added context management.  This was necessary after removing the JWSTDataModel inheritance to make with datamodels.open(asn.json) as container have the expected behavior. The only thing the context management does is close all the models in the container when exiting.
  • Removed the save method.  This is the change I'm least certain about.  The old save method would just save all the models in the container to a filename specified according to the way JWSTDataModel figures out default filenames.  Saving this way is no longer used in the pipeline after ModelLibrary was added, and philosophically it feels like if the container has a save method, it should make an association, since the main reason for its existence is easy manipulation of asn-type data.  If the desired behavior is to save all the models, I see no reason a user can't just loop over the container and do so.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Melanie Clarke on JIRA:

A quick note for discussion later, while we are refactoring ModelContainer...

There are some inconsistencies between how ModelContainer handles relative paths in association file expname values and how stpipe does it.  I think it can be traced back to ModelContainer not respecting the input_directory when it reads in files from an association: instead of using something like stpipe.step.make_input_path, it directly prepends the ASN directory (here, in your proposed refactor branch).

This was reported in an old ticket (JP-2038) and closed as won't-fix, but it would be nice to clean it up while we're working on making ModelContainer more usable and stpipe compatible.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

Melanie Clarke does the new test association file in this commit cover the problem that was being discussed?  And does it look like the new unit test covers that case adequately?  I hope I understood the problem correctly from the original ticket and long back-and-forth on it.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Melanie Clarke on JIRA:

I think not quite.  I am testing like this, with input imaging rate files set up in an 'input' directory, and two empty directories, 'img2' and 'img3', at the same level under the current directory:

    asn_from_list input/*fits -o img2/img2.json -r DMSLevel2bBase
    strun calwebb_image2 img2/img2.json --output_dir=img2

    asn_from_list img2/*cal.fits -o img3/img3.json --product-name img3
    strun calwebb_image3 img3/img3.json --output_dir=img3

The image2 run succeeds: it assumes the starting directory is the current working directory and finds the input at input/*fits, even though the association is in img2.  The image3 run fails, on both main and your branches, because it assumes the starting directory is img3, where the association file is, and looks for the files in img3/img2/*cal.fits.

I was hoping we could just quickly sort out this inconsistency, but it may be too complicated a side issue to be dealt with in this ticket.  On further examination, it doesn't look like either pipeline properly supports the input_directory parameter.

@stscijgbot-jp
Copy link
Collaborator Author

Comment by Ned Molter on JIRA:

Fixed by #8831 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant