Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul models so that "unfinished" metadata can be represented without cheating Pydantic #205

Open
jwodder opened this issue Nov 22, 2023 · 22 comments

Comments

@jwodder
Copy link
Member

jwodder commented Nov 22, 2023

Note: @satra filed a spiritually-identical issue at almost the same time: #204


(This is an accumulation of various things discussed elsewhere which need to be written down.)

Currently, it is possible for users of the Dandi Archive API to submit asset & Dandiset metadata that does not fully conform to the relevant dandischema model, and the Archive will accept, store, and return such flawed metadata, largely via use of DandiBaseModel.unvalidated (being replaced by Pydantic's construct in #203). I believe part of the motivation for this is so that web users can fill in metadata over multiple sessions without having to fill in every field in a single sitting.

This results in API requests for asset & Dandiset metadata sometimes returning values that do not validate under dandischema's models; in particular, if a user of dandi-cli's Python API calls get_metadata() instead of get_raw_metadata(), the call may fail because our API returned metadata that doesn't conform to our own models (See dandi/dandi-cli#1205 and
dandi/dandi-cli#1363).

The dandischema models should therefore be overhauled as follows:

  • There should exist models for representing Dandiset & asset metadata in a draft/unfinished state. These models should accept all inputs that we want to accept from users (both via the API and the web UI), store in the database, and return in API responses. (It is likely that such models will have all of their fields marked optional aside from the absolute bare minimum required.)

    • The get_metadata() methods of dandi-cli's Python API should return instances of these models.
  • There should exist functionality for determining whether an instance of a draft/unfinished model meets all of the requirements for Dandiset publication.

    • One possible way to implement this would be to have separate models for published metadata that inherit from the draft models and make various fields non-optional.

CC @satra @dandi/dandiarchive @dandi/dandi-cli

@yarikoptic
Copy link
Member

I thought to argue that it would add another combinatorial dimension, but well, we kinda already have it half way that in terms of separation of "Common" vs "Publishable"

dandischema/models.py:class CommonModel(DandiBaseModel):
dandischema/models.py:class Dandiset(CommonModel):
dandischema/models.py:class BareAsset(CommonModel):
dandischema/models.py:class Asset(BareAsset):
dandischema/models.py:class Publishable(DandiBaseModel):
dandischema/models.py:class PublishedDandiset(Dandiset, Publishable):
dandischema/models.py:class PublishedAsset(Asset, Publishable):

the question then is really to "shift" validation into something like .is_publishable() and may be indeed mirroring as PublishableAsset/PublishableDandiset counterparts with more restrictive model?

@sneakers-the-rat
Copy link

butting in here with uninvited 2 cents: to avoid needing to maintain a whole parallel set of models, you could use a field validator that indicates if validations fail by modifying a status field that indicates whether it is publishable or not.

like

@field_validator('*', mode='wrap')
@classmethod
def allow_optional[T](
    cls, 
    v: T, 
    handler: ValidatorFunctionWrapHandler, 
    info: ValidationInfo
) -> T:
    try:
        return handler(v)
    except ValidationError as e:
        # do something to indicate that we failed validation but still allow class instantiation, eg. 
        info.data['validation_errors'].append(e)
        return v

@model_validator(mode='after')
def validation_state(self):
    # do something here to check if we had validation errors that forbid us from being publishable

You could also dynamically control whether one should be validating publishable or draft standards using validator context like eg: https://docs.pydantic.dev/latest/concepts/validators/#using-validation-context-with-basemodel-initialization

@jwodder
Copy link
Member Author

jwodder commented Jan 31, 2024

@yarikoptic I don't think I'm the right person to assign this to (at least not at this time). The task requires someone (or discussions with several someones) who knows what the "minimal metadata" requirements are, what the "completed metadata" requirements are, and what's allowed for the in-between.

@yarikoptic
Copy link
Member

gotcha -- thanks. Let us dwell on this aspect -- may be during upcoming meetup with @satra at all.

@yarikoptic
Copy link
Member

I have been thinking about this but still not sure if we ever would be able to easily guarantee that even published Dandiset's metadata confirms "The Model". The reason is the same what haunts NWB (and PyNWB in particular ATM): model versions and the fact that we can and do break backward compatibility (e.g. #235 would make possibly legit prior models now invalid).

So, unless we can come up with a way to have a model "Valid" according to a specific version (which we cannot at pydantic level since we have only 1 "current" model version), we cannot guarantee that past "Valid" models remain valid currently.

@sneakers-the-rat
Copy link

sneakers-the-rat commented Apr 16, 2024

Seems like we would need model migration here, and that might be a good thing to have generally, no schema stays still ;).

I interject again here bc i have been wanting to do something similar and think it might be a nice little pydantic extension - Decorator/module level const gives model a particular version, each upgrade would need to provide a migration patch that can do up and down. When validating, run through each migration step and then report which versions failed - eg. if an optional param became required, the up method would check for that and fail otherwise.

I think that might be nicer than maintaining full copies of every version without clean ways to migrate between them. pydantic should allow model instantiation through validation errors in any case, and the migration methods would get us 'best effort' upgrades (eg. be able to perform renames and etc. but not magically fill in missing required values).

If the plan is to move to linkml eventually, this would be something i would be happy to work on with y'all in the pydantic generator, i have been doing bigg refactor work there and am in the process of finishing a patch to include all schema metadata, so if we want to make some helper code to do diffs between schema versions and use that to generate model migration code i would be super into working on.

edit: fwiw i have this mostly done for NWB in linkml by being able to pull versions of the schema and generate models on the fly - eg see the git provider and the schema provider where one just does PydanticProvider().get('core', '2.3.0') and gets back a module of models without needing to install a different version of the tool. that was written before i started refactoring linkml's pydanticgen so a lot of the bad parts in there can be replaced.

@satra
Copy link
Member

satra commented Apr 25, 2024

thanks @sneakers-the-rat - i think this may align well with the transition to linkml. i know @candleindark is working on an updated reverse linkml generator from the current pydantic models in dandischema. i suspect this will be a priority for us to redo the models using linkml in the upcoming month. we will also want to separate out schema validation into two components: 1) are appropriate values being assigned and 2) are required values being assigned. requirement is a project specific thing and hence this will also allow us to reuse schemas. also 2 allows us to further stratify requirement given the state of an asset (pre-upload, uploaded, modified, published) or dandiset.

we need a specific project timeline/design for this transition.

@sneakers-the-rat
Copy link

Lmk how I can help - happy to hack on the pydantic generator to make it work to fit yalls needs bc working with DANDI + NWB in linkml is exactly within my short term plans :)

@candleindark
Copy link
Member

@sneakers-the-rat My current plan is to improvement Pydantic to linkml generator as I participate in the transition to linkml in dandischema.

One approach is to build something that mimics the behavior of GenerateJsonSchema but instead of generating JSON schema, linkml schema will be generated. If this approach works, the parsing of the Pydantic models are already taken care of by the Pydantic library.

@sneakers-the-rat
Copy link

That would be great! There is already a sort of odd LinkMLGenerator but generalizing that out to accept pydantic models/json schema from them would be a very useful think for initial model import. LinkML definitely wants to work with linkml schema being the primary source of truth, aka do the pydantic model -> linkml schema conversion and from then on use the pydantic models generated from the linkml schema, but I have been working on the generator to make it easier to customize for specific needs eg. If yall want to separate some validation logic in a special way that the default generator doesnt do.

I overhauled the templating system recently to that end, see: https://linkml.io/linkml/generators/pydantic.html#templates
And https://linkml.io/linkml/generators/pydantic.html#cmdoption-gen-pydantic-template-dir
Programmatic example: https://github.com/linkml/linkml/blob/bd989b3a8d7fd336532a35a4019fca8877e915d6/tests/test_generators/test_pydanticgen.py#L956
CLI example: https://github.com/linkml/linkml/blob/bd989b3a8d7fd336532a35a4019fca8877e915d6/tests/test_scripts/test_gen_pydantic.py#L8

And im also working on making pydantic models fully invertible to linkml schema, so you could go DANDI models -> linkml schema -> linkml pydantic models -> customizef linkml DANDI models and then be able to generate the schema in reverse from those, but it might be more cumbersome to maintain than just customizing the templates and having the schema be the source of truth. See: linkml/linkml#2036

that way you can also do stuff that cant be supported in pydantic like all the RDF stuff (but im working on that next ) ;)

@sneakers-the-rat
Copy link

(Maybe we should make a separate issue for linkml discussions, sorry to derail this one)

@candleindark
Copy link
Member

butting in here with uninvited 2 cents: to avoid needing to maintain a whole parallel set of models, you could use a field validator that indicates if validations fail by modifying a status field that indicates whether it is publishable or not.

like

@field_validator('*', mode='wrap')
@classmethod
def allow_optional[T](
    cls, 
    v: T, 
    handler: ValidatorFunctionWrapHandler, 
    info: ValidationInfo
) -> T:
    try:
        return handler(v)
    except ValidationError as e:
        # do something to indicate that we failed validation but still allow class instantiation, eg. 
        info.data['validation_errors'].append(e)
        return v

@model_validator(mode='after')
def validation_state(self):
    # do something here to check if we had validation errors that forbid us from being publishable

You could also dynamically control whether one should be validating publishable or draft standards using validator context like eg: https://docs.pydantic.dev/latest/concepts/validators/#using-validation-context-with-basemodel-initialization

@sneakers-the-rat Thanks for the input. I think this is a solution to make field optional when receiving the user inputs while keeping the fields required at publication. However, one can't generate, at least not directly, two schema variants (one with the fields optional and the other with the fields required). It would be nice if we can the two different schema variants. Is SchemaView a potential solution to getting two schema variants?

@sneakers-the-rat
Copy link

Do you mean at a field level, being able to label a given slot as "toggle required" so at generation time you get two pydantic modules, one with those as required and one as optional? Or do you mean at a generator level, make two pydantic modules, one where all slots are required and one where all are optional?

Im assuming the former, where you want to make a subset of slots that are annotated as being part of a publishable criterion, and some computed slot is True if all those are met? I still am sorta thinking that generating two sets of modules would be a pain to maintain and use, and would personally lean towards a single set of models that can indicate what level of compliance the data is vs. Having several sets of nearly-identical models.

Another approach if ya still want to make multiple sets of models might be to do something like have one base model thats the most lax, then have a Publishable model inherit from that and use slot_usage to change the requiredness of some slots, and then you have a container class with a dataset slot with a any_of: [{range: DraftDataset}, {range: PublishableDataset}] and then pydantic validators would cast to the correct type based on what was present (probably would want to order them from most to least restrictive, but you know what I mean). That still sounds sort of hard to maintain and requires some additional container class, but might be a middle ground between one set of models and totally independent sets of models

@sneakers-the-rat
Copy link

Sorry to answer your question, I am betting that we could rig something up to generate different models on a switch. That would probably be easiest to do by making a step before schemaview where you load the schema, introspect on it to flip requireness depending on a flag, and then send that to schemaview. SV is sorta badly in need of a clarifying refactor bc at the moment its a bit of a "there be dragons" class (in my loving opinion, which is based on appreciating the winding road that led to SV)

@candleindark
Copy link
Member

Do you mean at a field level, being able to label a given slot as "toggle required" so at generation time you get two pydantic modules, one with those as required and one as optional? Or do you mean at a generator level, make two pydantic modules, one where all slots are required and one where all are optional?

I am not clear about this, but this can be that I have never tried generating two modules from one schema.

Sorry to answer your question, I am betting that we could rig something up to generate different models on a switch. That would probably be easiest to do by making a step before schemaview where you load the schema, introspect on it to flip requireness depending on a flag, and then send that to schemaview. SV is sorta badly in need of a clarifying refactor bc at the moment its a bit of a "there be dragons" class (in my loving opinion, which is based on appreciating the winding road that led to SV)

Yes, this is close to what I had in mind. Have one LinkML schema, the base schema, as the source of truth. From the base schema, we generate variants from it by toggling the required slots for example. Have all these variants published as a documentation as well so that users can see what are needed at different stages of their metadata. From the different variants of the LinkML schema, we can generate different Python modules containing the Pydantic models. The name of a Pydantic model will stay the same across the different Python modules.

@sneakers-the-rat
Copy link

sneakers-the-rat commented May 17, 2024

for that, probably the easiest way would be to use subsets, like

subsets:
  DraftSubset:
    rank: 0
    title: Draft Subset
    description: Set of slots that must be present for draft datasets
  PublishableSubset:
    rank: 1
    title: Publishable Subset
    description: Set of slots that must be present for publishable datasets
    
slots:
  optional_slot:
    required: false
    description: Only required for publishable datasets
    in_subset:
      - PublishableSubset
  required_slot:
    required: true
    description: Required on all datasets
    in_subset:
      - DraftSubset
      - PublishableSubset

and then you could either iter through slots and check the subsets they are in or use the get_subset method on SchemaView, modify the requiredness accordingly, then pass back to a new sv instance.

I still think that you can do this with one set of models! I am just imagining it being difficult to parse and also to write code for "ok now i import the DraftDataset model but when it comes time to publish i'll need to use PublishableDataset" but ultimately your call. Slight semantic difference between "these fields are optional for drafts but required for publishable datasets" to "these fields are optional, but the dataset is not publishable if they are absent."

we would just need to improve the pydantic generator to support equals_expression (which should happen anyway).

So then you would do something like

classes:
  Dataset
    attributes:
      attr1: "..."
      attr2: "..."
      publishable:
        equals_expression: "{attr1} and {attr2}"

which would make a model like

class Dataset(BaseModel):
    attr1: str
    attr2: str
    
    @computed_field
    def publishable(self) -> bool:
        return (self.attr1 is not None) and (self.attr2 is not None)

or we could combine the approaches and make an extension to the metamodel so we get the best of both - clear metadata on the slots and also single model.

# ...
equals_expression: "all(x, PublishableSubset, x)"

where the first x is the logical condition (in this case, just being truthy) and the second x is the variable to use to refer to the items in the subset in the logical condition (so eg you could also do all(x > 5, PublishableSubset, x)

I think this is probably a common enough need that it would be worth making a way to express this neatly in linkml, so we can probably make the metamodel/generators come to you as well as y'all hacking around the metamodel/generators :).


So to keep a running summary of possible implementations:

  • multiple schemas
  • use rules
  • use equals_expression
  • modify the pydanticgen templates
  • subsets with some hackery at generation time
  • make some notion of 'context' when generating models that flips behavior
  • multiple classes + container class that switches between class depending on presence/absence of values

@candleindark
Copy link
Member

candleindark commented Jan 23, 2025

I want to know more about why we are currently accepting unvalidated metadata to build Dandischema model objects? Is it, as @jwodder mentioned in the opening post, to allow web users to fill in metadata over multiple sessions?

If the answer is yes, for the second question. There may be a simple solution.

For each Pydantic model, we can generate two JSON schemas, one, Schema A, that is consistent with the Pydantic models (this one for public reference) and the other, Schema B, with all the properties of the model changed to optional (for internal use only). We can use Schema B to build the web UI for user input that allows submitting incomplete metadata. At the server, before it is requested to be finalized, the submitted metadata is not validated against any Pydantic model and is treated as a dictionary that is validated against Schema B at the server and stored as a JSON object. A user can request the submitted metadata to be finalized using the WebUI or the client. When the server receives such a request, the server uses the corresponding Pydantic model to validate the submitted metadata, i.e. to finalize it.

Using the above scheme has the following benefits:

  1. Separated the requirements for the assigned values from the requirements of what values need to be assigned. (Schema B retains the specifications for each property though all properties are marked as optional). This address of the issue mentioned by @satra at Overhaul models so that "unfinished" metadata can be represented without cheating Pydantic #205 (comment))
  2. We don't need to keep two different Pydantic models for each data model. (The second generated JSON schema for each model is for internal use only)
  3. There is no confusion about which instances of metadata must conform to a Pydantic model (the finalized ones).

I don't know how this solution can address #205 (comment) brought up by @yarikoptic. However, I think that can be addressed separately.

@jwodder
Copy link
Member Author

jwodder commented Jan 23, 2025

@candleindark

Schema B, with all the properties of the model changed to optional

Is making everything optional the correct approach? What about foundational fields like id and schemaKey? Going the other way, I think there are also some string fields on which we enforce required formats; do we want those relaxed for Schema B?

We don't need to keep two different Pydantic models for each data model. (The second generated JSON schema for each model is for internal use only)

I disagree on this point. The API can still return "schema B metadata" for objects in unpublished versions, and providing a structured, typed representation is desirable. Otherwise, the vast majority of DANDI metadata can only be manipulated via raw, unstructured dicts like a JavaScript programmer.

@candleindark
Copy link
Member

candleindark commented Jan 24, 2025

@candleindark

Schema B, with all the properties of the model changed to optional

Is making everything optional the correct approach? What about foundational fields like id and schemaKey?

If there are fields that are required even for unfinalized (unfinished) metadata instances, making every fields optional in Schema B is not the correct approach. However, we can selectively keep those required fields as required in generation of Schema B.

Going the other way, I think there are also some string fields on which we enforce required formats; do we want those relaxed for Schema B?

If the format requirements are enforced in the JSON schema level (i.e. specified in Schema A), they should be available in Schema B as well. The only differences between Schema A and B are the differences in what fields are required.

We don't need to keep two different Pydantic models for each data model. (The second generated JSON schema for each model is for internal use only)

I disagree on this point. The API can still return "schema B metadata" for objects in unpublished versions, and providing a structured, typed representation is desirable. Otherwise, the vast majority of DANDI metadata can only be manipulated via raw, unstructured dicts like a JavaScript programmer.

You are right. If unfinalized metadata instances are to be made available, it would be desirable to make Schema B available as well. The down side is that doing so would make two JSON schemas, both A and B, public for each Dandi schema model. However, the upside is that the schemas correctly document the return data of the corresponding API endpoints.

My point really is that we can generate a version of JSON schema for a Pydantic model that allows incomplete data instances to facilitate users to provide metadata in multiple sessions. Use Pydantic model (also Schema A if want to be thorough) to fully validate the data instances when they are ready to be finalized. In this way, we always know that

  1. Finalized instances are always valid (both having the right value for each field and having all the required fields).
  2. Unfinalized instances may not have all the fields but have the right values for the fields that are provided, per JSON schema validation against Schema B.

Note: In the discussion I used "unfinalized" to denote metadata instances that are yet to be completed by the user. I suspected that the word "unpublished" metadata instances may have a slightly different meaning.

Lastly, it would be helpful if someone could provide a detailed answer to the first question in my previous post: why are we currently accepting unvalidated metadata to build Dandischema model objects? Furthermore, what are the desired behaviors we aim to achieve in the models, and which current behaviors are we willing to forgo?

@satra
Copy link
Member

satra commented Jan 24, 2025

@candleindark - perhaps use linkml mixins and create different models with increasing requirements. it may be helpful to have a clear understanding of the workflow of upload/update/GET requests and based on that decide where the models should be split or requirements changed.

@sneakers-the-rat
Copy link

taking a look at the validation code now, and it looks like another way you could get a) not duplicated models, b) optional values but still validated for type/value correctness if present, c) still have some required fields like id and schemaKey d) not allow null in the JSON Schema (which is what I think the main thing the json_schema validation does using this?) would be to do something like this, making a special type :

expand/collapse code example
from typing import Annotated, Any, TypeAlias, TypeVar

from pydantic import (
    BaseModel,
    ConfigDict,
    ValidationInfo,
    ValidatorFunctionWrapHandler,
    WrapValidator,
)


def _optional_if_draft(
    value: Any, handler: ValidatorFunctionWrapHandler, info: ValidationInfo
) -> Any:
    if (
        isinstance(info.context, dict)
        and info.context.get("status", None) == "draft"
        and value is None
    ):
        return value
    else:
        return handler(value)


T = TypeVar("T")
DraftOptional: TypeAlias = Annotated[T, WrapValidator(_optional_if_draft)]


class DandiModel(BaseModel):
    model_config = ConfigDict(validate_default=True)
    
    # still required
    id: str

    # required by default, but optional if draft mode
    # default avoids pre-emptive 'missing' check
    other_value: DraftOptional[int] = None


# validates normally if validating by instantiation
# this raises a validation error
_ = DandiModel(id="12345")

# optional fields are optional when in draft mode
# this works
_ = DandiModel.model_validate({"id": "12345"}, context={"status": "draft"})

# even in draft mode, invalid values are rejected
_ = DandiModel.model_validate(
    {"id": "12345", "other_value": 1.234}, context={"status": "draft"}
)

that might work in a pinch. ultimately it seems like it's more of an architectural question of dividing validation logic from input, but just volunteering that there are a decent number of different ways that this can be handled in pydantic without needing to juggle a bunch of json schema :)

@yarikoptic
Copy link
Member

@candleindark - perhaps use linkml mixins and create different models with increasing requirements.

such line of thinking is inline also with thinking of a solution for "vendoring" I brought up in #274 : having the base, non-vendored, model validating general format of IDs/URLs to be legit, whenever particular vendor mixing (e.g. dandi one) would add additional constraints (e.g. id starts with DANDI:, and URL is based on DJANGO_DANDI_WEB_APP_URL env var).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants