-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Versioned RAD #305
Conversation
Maybe easiest things first: my impression is that the SDF / archive info are guidance to Kim & Jason about how the databases they're building should look. But there is presently a human in that loop; I don't think your changes will break anything there. Do you have an example with a converter? Do we need to write a converter for each version, i.e., v1 -> v2, v2 -> v3, v3 -> v4, ..., and then when reading an old v1 file the whole chain gets played forward to make a new file? Or something? I saw the ~generic converter in your branch of roman_datamodels but it wasn't clear to me that that was what I was looking for; I will look again if you tell me to. Mechanically, I think we do need a "does this in memory structure validate against this tag" routine, since we use that a fair amount in the various tests. I imagine you already have that, but the break between tag / in-memory-object makes that connection a little more diffuse. |
Should I put you on the agenda for Thursday romancal tag up to talk more about this? |
I'm still thinking about the contents, but one quick question. When you say no dynamic node creation, you mean regarding the generation of the classes themselves, but not with regard to static definitions of attributes (those are still checked against the schemas, but not presumed to exist on the creation of a class instance)? |
A comment about types of possible changes to schemas. The WCS example is a good one in that I have a hard time imagining that an old file using the old schema version makes any sense with the more modern pipeline. This reflects pipeline and file evolution and supporting the old files (they will all be prelaunch) it mostly pointless (other than for converting data files used for testing purpose). But then there are examples of added attributes that don't necessarily make the pipeline unusable for old versions of the schemas. These could be items with additional information that improve calibration for which the pipeline can use a default if not present. It seems that, at least for a while, that these would not necessarily be required. Can someone give an example of one that, if not provided, means the pipeline must fail? At least it would be good for us to try to list policies on this sort of thing (e.g., if the pipeline is easily made backward compatible, it should be optional in the schema). Doing this ought to make the extension versioning less involved, wouldn't it? And if possible, strongly encouraged. For cases where the pipeline cannot use the old version without modifying the contents of the node (e.g., like adding wcs), perhaps the better solution are conversion utilities to do that outside of the pipeline. |
It may be useful to provide some examples of how cases of changed schemas where the meaning of an attribute changes (e.g., different implied units, or other modifications to the understanding of a value) can be obtained by the pipeline or extension. If such a case exists, can the extension test for the actual tag version, or must there be a different extension? Hopefully such cases won't exist, but if they do showing how they are handled by the pipeline or extension would be helpful. Or should we rely on a data migration tool? ("The pipeline no longer supports this version of the data; run this utility to convert to a supported version") |
Thanks for taking a look! I'm happy to discuss this through whatever means is most convenient for those involved (this PR discussion may become a bit of a mess as the discussion continues). @schlafly let me know if it's helpful for me to join the tag-up (and feel free to share the details via slack or email).
Thanks! It would be helpful to include them in discussions about the RAD changes.
Are you referring to something that will migrate nodes from one version to another (and not an asdf converter which handles serialization/deserialization of nodes)? If so, I don't have an example of something that will migrate a node from one version to another. I am currently thinking about something like the following. Let's say a file is opened that contains a m = rdm.open('old_file.asdf') # m is a ImageModel
m.tag # currently asdf://stsci.edu/datamodels/roman/tags/wfi_image-1.0.0 because the tag comes from the WfiImage node It is possible to overwrite this tag (this branch no longer stores the tag on the class but instead stores it on the instance). m._instance._tag = 'asdf://stsci.edu/datamodels/roman/tags/wfi_image-2.0.0' Assuming To consider a different change:
I am starting to think that changing the tag version as a means of migration might be a thing to consider as a first-pass at data migration.
Good point! With the current changes in this PR there is a link between the nodes and tags (as seen above). The major change is that the link is no longer "one-to-one". A tag for a given node (
Exactly! Validation on assignment still works and the attributes are read from the schema. The key change was the removal of the "factories" used to make the nodes from the schemas.
Thanks for the point!
I'm not sure if this is an example that fits what you're looking for but looking back at old changes I did fine one where dark_slope was added to the reference file). Let's assume that
To expand on the dark example, if
Would you elaborate on what you mean by "obtained by the pipeline or extension"? I'm not sure if this is relevant, but in this example: https://github.com/spacetelescope/rad/pull/300/files
That's an interesting idea. Perhaps this is one way the pipeline migration could work. When the file is opened, all tags are set to the newest versions and the model is validated. If validation fails, then report the issue to the user and then provide separate migration tools. This should handle all the trivial migrations and for reference files the user can just download the newest version (or update their crds cache). Data files will likely be the trickiest to handle although re-running earlier steps (to produce files with newer tags assuming the earlier steps don't have the same migration issues) might be workable in some cases. Thanks @perrygreenfield and @schlafly for your comments so far. Please let me know if I missed (or misinterpreted) anything. As a tldr; I'm currently thinking that retagging and validating nodes might be the first step towards "migration" and seems compatible with the "versioning" proposal outlined above (but will require updates to the code in this PR to handle the retagging). Also, it sounds like collecting a few more examples would help to guide the discussion. |
Regarding my comment on "obtained by the pipeline or extension" what I was wondering about was if a older versioned file was opened by the pipeline (I guess by an older version of an extension, which produced a node that was incomplete as the current version expects), how does the pipeline know that it was produced by an older tag, if the tag is no longer associated with the node other than as a wild-carded tag, is there a way for the pipeline to know that it was generated from an older version, and for which it knows that it needs to add more information. Or is this handled by the new extension that recognizes the older tag and uses the version in that process to make needed changes? I guess I'm not entirely sure who fixes what (And I tend to favor explicit migration tools. These can be part of the archive as so that files are updated when the schemas are changed.) But you can see I haven't thought this all through :-) |
Thanks for the clarification! For what I'm proposing here, when an old file is opened, nodes would be created that comply with the older schema and have older versions of tags. For the changes described here, nodes still have specific tags. The key change is that the tag is associated with the node instance instead of the node class (so each node instance might have a different tag version). Using the wcs example if we have an "old_file.asdf" that was written with the 1.0.0 extension and a "new_file.asdf" written using the 2.0.0 extension opening the files will produce nodes that have the read tags: old_model = rdm.open("old_file.asdf")
assert isinstance(old_model, ImageModel)
assert "wfi_image-1.0.0" in old_model.tag
new_model = rdm.open("new_file.asdf")
assert new_model.__class__ == old_model.__class__ # both are ImageModel
assert "wfi_image-2.0.0" in new_model.tag
assert new_model.tag != old_model.tag # however the read tags are different versions So in this case the extension/
Perhaps |
That clarification was very helpful. |
Perry has a ~suggestion that we should make most new metadata optional in order to avoid schema churn. That has not been our policy so far; often we define things as optional initially during development so as not to break things and then change them to required once we have updated the files. This has the nice feature that we cannot later "lose" these new bits of metadata that we intend to propagate, and that when we say that we have a file that validates, it really has all of the metadata that we want. If we change the criterion for 'required' to be 'does the pipeline need this to run', most of the metadata will become optional, and I think that's not really what we want. On the other hand, it would dramatically reduce churn and would make a clear separation between the things that we are just passing through and filling with empty sentinel values and the things that we are actually paying attention to, which would be valuable. |
For ease of reading, I will group my responses by library. RCAL: As was discussed at last week's RomanCAL tag-up, RCAL should focus solely on the current versions of the schemas, not on previous ones. This opens a Pandora's Box of multiple algorithms and issues with cross versioned schemas. We should focus on version support solely in RAD & RDM, primarily for users to be able to interact with files they have downloaded that may "age out," version-wise. With regards to RCAL, I think the most we should consider is pinning minimum versions of schemas for validity with the current version of RCAL. |
RAD: Regarding sdf & archive data:
Regarding the dev manifest - what is the advantage of doing this versus working in an increased version number, which gets frozen at library release? Otherwise, there isn't much in RAD to comment on. |
RDM: This is the big one, where the major headaches lie. Given its present modular design (model, maker utilities, etc. are all built from schemas), I am wondering if this modularity could be expanded to support whatever schema version RAD has, building the end-result product on the fly. Given that future versions tend to be a superset of previous versions, this could be feasible, with a small set of exceptions for type / size changes. I think this will be the least painful and most flexible route forward, rather than making models and maker_utilities for every version of every schema. By default, I think that RDM should write models to the same versions they were originally, instead of trying to force an upgrade to current. We could offer the option to do this, utilizing the dummy values from the maker utilities. If a user wants to "upgrade" their file, they could create something that validates, but may or may not work with the pipeline without them replacing the dummy value objects with proper values. The work here needs to be carefully considered, as it could telescope easily. |
Thanks for taking a look and the detailed comments.
One thing to consider here is how regression tests will work with "real" data produced by the production pipeline. Regression tests that use "real" data will require updating these input files as tags increase versions. It might not always be possible to regenerate the file from it's inputs (unless the inputs are also available). Updating the files manually seems doable as the regression tests can be used to make sure the file still serves it's purpose in the test (and that the update did not introduce a new issue). As @eslavich brought up in some discussions, It seems likely that migrating regression test files will be (and perhaps is now) an issue, it may help to consider if tooling can be developed to make this easier.
The main advantage I see here is the 'dev' portion of the version provides an indication that the version is not stable. Using a non-dev version (
These are all excellent points. As the |
Having the ability to regression test outdated versions of files isn't terribly helpful, as the pipeline is intended to support the latest versions only. Accordingly, any "real" data regression files that cannot be updated would need to be replaced with new "real" data regression files. |
This PR is to discuss options for versioning rad and roman_datamodels.
I did not go into all the details to keep things "brief" with the hope that this would help to continue the discussion.
Goal
The main goal is to allow older files to be opened (with other goals/considerations discussed below).
Example: adding "wcs" and opening existing files
It may be helpful to discuss a specific example of a change that at the moment breaks opening (many) older files: adding wcs to wfi_image. The change involves adding
wcs
as a required property for theschemas/wfi_image-1.0.0
schema. This schema is used to validate objects tagged withtags/wfi_image-1.0.0
(as defined in the manifest). All files that contained anWfiImage
at the time of writing will contain the tagtags/wfi_image-1.0.0
(written to the ASDF file along with the serialized representation ofImageModel
). As these old files were written beforewcs
was required, they will not contain awcs
and will fail schema validation (against the updatedschemas/wfi_image-1.0.0
) when the file is read and the contents are validated. To summarize, the addition ofwcs
as a required attribute makes allImageModel
containing files unreadable.ASDF extension versioning
The problem of how to update a schema while still being able to open old files might be solved by using asdf's ability to version extensions. To be able to open old files, every schema change should result in a new schema version, new schema file and importantly leave the old schema version (and file) intact (see below for a note on how to handle "development" of schemas where this can be more flexible).
For the wcs example, adding
wcs
as a required property could be done while allowing for older files to be opened if:wcs
property was added to a new schema (schemas/wfi_image-2.0.0
)tags/wfi-image-2.0.0
)datamodels-2.0.0
(note that this is not compatible with the current
roman_datamodels
due to issues discussed below)To play through the steps performed when asdf tries to open the old file (assuming the above schema changes were made):
tags/wfi-image-1.0.0
and uses the oldschemas/wfi_image-1.0.0
to validate the serialized contentsConverter
provided byroman_datamodels
) which produces the appropriatenode
Now that the
node
is in memory, if the file is rewritten asdf will:Converter
to serialize thenode
using the newtags/wfi-image-2.0.0
schemas/wfi_image-2.0.0
(which will only pass if awcs
was added)(note that in the above example the
node
was created using one tagtags/wfi-image-1.0.0
and written using a different tagtags/wfi-image-2.0.0
, more on that below)This is often how asdf extensions are written, where during read/deserialization the tag is lost and on write/serialization the newest tag is used.
roman_datamodels and linking specific tags to nodes
roman_datamodels
currently links in-memorynodes
with specific versions of tags (and schemas). As seen in the docsWfiImage
is linked specifically totags/wfi_image-1.0.0
. There is currently no precedent for how to support a newer version of this tag (ietags/wfi_image-2.0.0
) which is currently preventing versioning the schemas/tags. One option (discussed here) would be to break this static link between tags and nodes. For our exampleWfiImage
would no longer be specifically linked totags/wfi_image-1.0.0
and instead use the tag patterntags/wfi_image-*
to denote that anywfi_image
tag can be associated withWfiImage
.This tag_pattern matching is used extensively in asdf. It is common to have converters handle all supported tags that match a pattern (see the Table converter in asdf_astropy) and to use the extension manifest to describe a set of self-consistent tags (again see asdf_astropy). Please let me know if it's helpful to go into more details here as the exact process asdf follows for matching tags and extensions can be a bit complicated. For now I will skip the details and summarize the process as:
Proposal to make tag/node links using tag_patterns
I am proposing that rad/roman_datamodels adopt this strategy for handling versions and supporting old files. Some specific changes that will be required are:
Issues related to data migration (converting a node read from
1.0.0
to write to2.0.0
)The process detailed here has not yet addressed the problem of how to migrate data from an old to a new version. Using the "wcs" example, if a
WfiImage
node is loaded usingwfi_image-1.0.0
it will not contain (nor require) awcs
. This will mean that on write, if an attempt is made to write using thewfi_image-2.0.0
tag (which requires awcs
) the write will fail because the serializedWfiImage
will not validate using thewfi_image-2.0.0
schema. Any user who opens a file, makes an otherwise compatible change is left with an object they cannot save. I do not yet have what I feel is a reasonable proposal for how to deal with this (nor do I feel that I've thought through possible cases where this issue will arise). To propose something, we could:I can see benefits to having migrations in either romancal or roman_datamodels:
I don't see data migration as a new problem but giving it some thought and discussion may help to inform if the versioning strategy detailed here is worth pursuing. For the "wcs" example (please correct my mistakes), I believe the changes were only for later stage (level 2/3) files where the "wcs" can be computed from the input. For production of new level 2 (or level 3) images, the new node/model would use the new tag and a "wcs" would be expected. For steps that accept a level 2 (or 3) file as input, those steps would need to be updated to deal with input that does not contain a "wcs" (perhaps requiring version 2.0.0 files). As I don't expect a "wcs" could be filled in for these files they would never be suitable for processing. So in this case no migration would be defined and instead an attempt to migrate 1.0.0 to 2.0.0 should fail (I expect this will also be true for some file types in general, like reference files).
Using a dev manifest to ease development
As mentioned above, how can we easily develop schemas if every change requires a new tag, schema, manifest, extension, etc? One approach would be to designate 'stable' versions and an 'unstable/dev' version. Breaking changes could be made to 'dev' and only the integrity and support of the 'stable' versions would be of importance. For our example,
wcs
would be added towfi_image-2.0.0
which would be registered with asdf using extensiondatamodels-2.0.0-dev
(leavingdatamodels-1.0.0
, which containswfi_image-1.0.0
unchanged). This would allow old files to be read. On write, a decision would need to be made about usingwfi_image-2.0.0
(the tag in the 'dev' manifest) orwfi_image-1.0.0
for theWfiImage
node. By disconnecting the in-memory nodes and tags (as discussed) above, the selection of tags can be controlled by which extensions are registered with asdf (if no extension uses the 'dev' manifest, no 'dev' tags will be used). There are several situations that should be considered for when to use 'dev' tags:I am proposing the following process:
RAD_DEV
that when set, triggers use of the development schema through registering the 'dev' manifest withroman_datamodels
andasdf
. The default for this is discussed below.rad
is detected, set theRAD_DEV
default toTrue
. This should allow local development and regtests to by default use the 'dev' schemas. However for pypi users and the production pipeline, no 'dev' tags would be used.This does mean that for regtests we likely want the input files to use stable tags and the truth files to use dev tags (it seems unavoidable that a change to a schema like requiring
wcs
will not also require updates to regtest truth files). Making sure the pipeline handles old files might also require more regtests that use older files from earlier stable versions.How this impacts releases
The next important item to discuss is how to release a new 'stable' version. Using the above approach, it should be possible to:
datamodels-2.0.0-dev
todatamodels-2.0.0
)RAD_DEV
off (as the truth files should already be up-to-date with the dev tags, no failures should be seen by using the new 'stable' tags)Branches to discuss
In an effort to explore this option I made the following branches:
The rad changes are rather minor (add a new manifest, slight change to version of the existing manifest).
However the roman_datamodels changes are more extensive. One major change (that may not be necessary in the long-run) was the removal of all dynamic node creation (which was previously based off the schemas). It seems likely that this can be re-added (if that feature is of importance) however I found it helpful to remove the dynamic creation to better understand how nodes were mapped to tags (to schemas, to models, etc). At the moment I don't believe any branch is ready for an open PR. However, I wanted to solicit feedback on this before taking this any further as there are many considerations described above and I'm sure many that I've missed.
To check for major downstream changes I did run a set of romancal regtests which passed without issue (these were run prior to rebasing to main with the wcs changes): https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FRoman-Developers-Pull-Requests/detail/Roman-Developers-Pull-Requests/561/pipeline/247
sdf and archive data
I must admit I do not know enough about the
sdf
andarchive_catalog
schema entries, like these, to know how the above changes would impact other users of rad. Some specific questions I have are:Help here (and on everything discussed) is greatly appreciated!