Improve performance of `DataModel` reading for single (or simple) metadata access #284

braingram · 2024-03-14T21:29:45Z

There are use cases (like the jwst resample step) where loading a single keyword from many DataModel containing fits files may be useful. As the number of files might be very large and opening every model might exceed reasonable amounts of RAM it will be important to have a performant way to perform these simple keyword accesses.

Using meta.wcsinfo.s_region (contained in the ASDF extension) as an example there are a few ways this keyword can be read:

Using stdatamodels.jwst.datamodels.open:

with dm.open(fn) as m:
    return m.meta.wcsinfo.s_region

Using stdatamodels.asdf_in_fits:

with asdf_in_fits.open(fn) as af:
    return af['meta']['wcsinfo']['s_region']

Using astropy.io.fits and asdf.open:

with fits.open(fn) as ff:
    bs = io.BytesIO(ff['ASDF'].data.tobytes())
    with asdf.open(bs) as af:
        return af['meta']['wcsinfo']['s_region']

Using astropy.io.fits and asdf.util.load_yaml:

with fits.open(fn) as ff:
    bs = io.BytesIO(ff['ASDF'].data.tobytes())
    tree = asdf.util.load_yaml(bs)
    return tree['meta']['wcsinfo']['s_region']

Of the 4 options above 1-3 are similar in performance (using both a ImageModel and a larger IFUImageModel as test files). With performance being limited primarily by asdf.open (more on that below). 4 is much faster in both cases. See the below table for performance (run with cProfile so slightly slower than real).

	ImageModel	IFUImageModel
dm.open	0.245	3.945
asdf_in_fits	0.137	3.815
asdf.open	0.143	3.820
load_yaml	0.036	0.332

The table shows it's ~10x faster to use load_yaml as this skips:

validation
mapping fits based on the schema
loading arrays
conversion from asdf flavored yaml to custom python objects

Although all of the above is public API it seems worthwhile to investigate wrapping 4 as a helper function in stdatamodels (with sufficient documentation about how this is dangerous).

Below is a snakeviz generated graph of the call to dm.open for the IFUImageModel data file showing the bulk of the time spent in asdf.open:

The text was updated successfully, but these errors were encountered:

nden · 2024-03-15T12:38:04Z

Although all of the above is public API it seems worthwhile to investigate wrapping 4 as a helper function in stdatamodels (with sufficient documentation about how this is dangerous).

I agree having this function would be very useful. What dangers do you foresee?

braingram · 2024-03-15T13:19:46Z

Thanks for taking a look at this issue. Most of the dangers are that it bypasses all of the stdatamodels and asdf machinery. This means:

no validation of the tree (although this was validated on write)
no checking that the tree matches the fits data (even via the FITS_HASH, so if a user modifies a fits header the modification will not be seen with load_yaml)
no filling of defaults or reference resolution (although I'm not sure either of these are used)
no conversion from tagged to custom objects (so many objects, like arrays, will come through as dictionaries)
no mapping of arrays to fits data

I think for the s_region use case all of the above are ok. Are there other places in jwst where this might be useful?

perrygreenfield · 2024-03-15T16:27:53Z

If one were concerned with validation, the option could be to do it the current way (though I would not make that the option for operations since the file should have been validated, and no sneaky, invalid updates should have been done). But I agree that an easy-to-use version of 4) should be provided. I can see mining of the files meta data is extremely useful, particularly out of operations and should be as efficient as possible.

braingram · 2024-03-28T19:37:37Z

ModelContainer in jwst also provides a models_grouped method that groups models based on a small set of metadata keywords:
https://github.com/spacetelescope/jwst/blob/master/jwst/datamodels/container.py#L466
These keywords are all strings in the primary fits header. Related to this issue, it would be useful to have an efficient way to read these keywords without incurring the overhead of traversing the schema. These attributes are duplicated in the ASDF extension, however the values in the fits header should be preferred (as is already done for datamodels.open).

It would be great if the efficient metadata access code could also load multiple fits keywords.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `DataModel` reading for single (or simple) metadata access #284

Improve performance of `DataModel` reading for single (or simple) metadata access #284

braingram commented Mar 14, 2024 •

edited

Loading

nden commented Mar 15, 2024

braingram commented Mar 15, 2024 •

edited

Loading

perrygreenfield commented Mar 15, 2024

braingram commented Mar 28, 2024

Improve performance of DataModel reading for single (or simple) metadata access #284

Improve performance of DataModel reading for single (or simple) metadata access #284

Comments

braingram commented Mar 14, 2024 • edited Loading

nden commented Mar 15, 2024

braingram commented Mar 15, 2024 • edited Loading

perrygreenfield commented Mar 15, 2024

braingram commented Mar 28, 2024

Improve performance of `DataModel` reading for single (or simple) metadata access #284

Improve performance of `DataModel` reading for single (or simple) metadata access #284

braingram commented Mar 14, 2024 •

edited

Loading

braingram commented Mar 15, 2024 •

edited

Loading