-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul models so that "unfinished" metadata can be represented without cheating Pydantic #205
Comments
I thought to argue that it would add another combinatorial dimension, but well, we kinda already have it half way that in terms of separation of "Common" vs "Publishable"
the question then is really to "shift" validation into something like |
butting in here with uninvited 2 cents: to avoid needing to maintain a whole parallel set of models, you could use a field validator that indicates if validations fail by modifying a like @field_validator('*', mode='wrap')
@classmethod
def allow_optional[T](
cls,
v: T,
handler: ValidatorFunctionWrapHandler,
info: ValidationInfo
) -> T:
try:
return handler(v)
except ValidationError as e:
# do something to indicate that we failed validation but still allow class instantiation, eg.
info.data['validation_errors'].append(e)
return v
@model_validator(mode='after')
def validation_state(self):
# do something here to check if we had validation errors that forbid us from being publishable You could also dynamically control whether one should be validating publishable or draft standards using validator context like eg: https://docs.pydantic.dev/latest/concepts/validators/#using-validation-context-with-basemodel-initialization |
@yarikoptic I don't think I'm the right person to assign this to (at least not at this time). The task requires someone (or discussions with several someones) who knows what the "minimal metadata" requirements are, what the "completed metadata" requirements are, and what's allowed for the in-between. |
gotcha -- thanks. Let us dwell on this aspect -- may be during upcoming meetup with @satra at all. |
I have been thinking about this but still not sure if we ever would be able to easily guarantee that even published Dandiset's metadata confirms "The Model". The reason is the same what haunts NWB (and PyNWB in particular ATM): model versions and the fact that we can and do break backward compatibility (e.g. #235 would make possibly legit prior models now invalid). So, unless we can come up with a way to have a model "Valid" according to a specific version (which we cannot at pydantic level since we have only 1 "current" model version), we cannot guarantee that past "Valid" models remain valid currently. |
Seems like we would need model migration here, and that might be a good thing to have generally, no schema stays still ;). I interject again here bc i have been wanting to do something similar and think it might be a nice little pydantic extension - Decorator/module level const gives model a particular version, each upgrade would need to provide a migration patch that can do I think that might be nicer than maintaining full copies of every version without clean ways to migrate between them. pydantic should allow model instantiation through validation errors in any case, and the migration methods would get us 'best effort' upgrades (eg. be able to perform renames and etc. but not magically fill in missing required values). If the plan is to move to linkml eventually, this would be something i would be happy to work on with y'all in the pydantic generator, i have been doing bigg refactor work there and am in the process of finishing a patch to include all schema metadata, so if we want to make some helper code to do diffs between schema versions and use that to generate model migration code i would be super into working on. edit: fwiw i have this mostly done for NWB in linkml by being able to pull versions of the schema and generate models on the fly - eg see the git provider and the schema provider where one just does |
thanks @sneakers-the-rat - i think this may align well with the transition to linkml. i know @candleindark is working on an updated reverse linkml generator from the current pydantic models in dandischema. i suspect this will be a priority for us to redo the models using linkml in the upcoming month. we will also want to separate out schema validation into two components: 1) are appropriate values being assigned and 2) are required values being assigned. requirement is a project specific thing and hence this will also allow us to reuse schemas. also 2 allows us to further stratify requirement given the state of an asset (pre-upload, uploaded, modified, published) or dandiset. we need a specific project timeline/design for this transition. |
Lmk how I can help - happy to hack on the pydantic generator to make it work to fit yalls needs bc working with DANDI + NWB in linkml is exactly within my short term plans :) |
@sneakers-the-rat My current plan is to improvement Pydantic to linkml generator as I participate in the transition to linkml in dandischema. One approach is to build something that mimics the behavior of |
That would be great! There is already a sort of odd LinkMLGenerator but generalizing that out to accept pydantic models/json schema from them would be a very useful think for initial model import. LinkML definitely wants to work with linkml schema being the primary source of truth, aka do the pydantic model -> linkml schema conversion and from then on use the pydantic models generated from the linkml schema, but I have been working on the generator to make it easier to customize for specific needs eg. If yall want to separate some validation logic in a special way that the default generator doesnt do. I overhauled the templating system recently to that end, see: https://linkml.io/linkml/generators/pydantic.html#templates And im also working on making pydantic models fully invertible to linkml schema, so you could go DANDI models -> linkml schema -> linkml pydantic models -> customizef linkml DANDI models and then be able to generate the schema in reverse from those, but it might be more cumbersome to maintain than just customizing the templates and having the schema be the source of truth. See: linkml/linkml#2036 that way you can also do stuff that cant be supported in pydantic like all the RDF stuff (but im working on that next ) ;) |
(Maybe we should make a separate issue for linkml discussions, sorry to derail this one) |
@sneakers-the-rat Thanks for the input. I think this is a solution to make field optional when receiving the user inputs while keeping the fields required at publication. However, one can't generate, at least not directly, two schema variants (one with the fields optional and the other with the fields required). It would be nice if we can the two different schema variants. Is |
Do you mean at a field level, being able to label a given slot as "toggle required" so at generation time you get two pydantic modules, one with those as required and one as optional? Or do you mean at a generator level, make two pydantic modules, one where all slots are required and one where all are optional? Im assuming the former, where you want to make a subset of slots that are annotated as being part of a Another approach if ya still want to make multiple sets of models might be to do something like have one base model thats the most lax, then have a |
Sorry to answer your question, I am betting that we could rig something up to generate different models on a switch. That would probably be easiest to do by making a step before schemaview where you load the schema, introspect on it to flip requireness depending on a flag, and then send that to schemaview. SV is sorta badly in need of a clarifying refactor bc at the moment its a bit of a "there be dragons" class (in my loving opinion, which is based on appreciating the winding road that led to SV) |
I am not clear about this, but this can be that I have never tried generating two modules from one schema.
Yes, this is close to what I had in mind. Have one LinkML schema, the base schema, as the source of truth. From the base schema, we generate variants from it by toggling the |
for that, probably the easiest way would be to use subsets, like subsets:
DraftSubset:
rank: 0
title: Draft Subset
description: Set of slots that must be present for draft datasets
PublishableSubset:
rank: 1
title: Publishable Subset
description: Set of slots that must be present for publishable datasets
slots:
optional_slot:
required: false
description: Only required for publishable datasets
in_subset:
- PublishableSubset
required_slot:
required: true
description: Required on all datasets
in_subset:
- DraftSubset
- PublishableSubset and then you could either iter through slots and check the subsets they are in or use the I still think that you can do this with one set of models! I am just imagining it being difficult to parse and also to write code for "ok now i import the we would just need to improve the pydantic generator to support So then you would do something like classes:
Dataset
attributes:
attr1: "..."
attr2: "..."
publishable:
equals_expression: "{attr1} and {attr2}" which would make a model like class Dataset(BaseModel):
attr1: str
attr2: str
@computed_field
def publishable(self) -> bool:
return (self.attr1 is not None) and (self.attr2 is not None) or we could combine the approaches and make an extension to the metamodel so we get the best of both - clear metadata on the slots and also single model. # ...
equals_expression: "all(x, PublishableSubset, x)" where the first I think this is probably a common enough need that it would be worth making a way to express this neatly in linkml, so we can probably make the metamodel/generators come to you as well as y'all hacking around the metamodel/generators :). So to keep a running summary of possible implementations:
|
I want to know more about why we are currently accepting unvalidated metadata to build Dandischema model objects? Is it, as @jwodder mentioned in the opening post, to allow web users to fill in metadata over multiple sessions? If the answer is yes, for the second question. There may be a simple solution. For each Pydantic model, we can generate two JSON schemas, one, Schema A, that is consistent with the Pydantic models (this one for public reference) and the other, Schema B, with all the properties of the model changed to optional (for internal use only). We can use Schema B to build the web UI for user input that allows submitting incomplete metadata. At the server, before it is requested to be finalized, the submitted metadata is not validated against any Pydantic model and is treated as a dictionary that is validated against Schema B at the server and stored as a JSON object. A user can request the submitted metadata to be finalized using the WebUI or the client. When the server receives such a request, the server uses the corresponding Pydantic model to validate the submitted metadata, i.e. to finalize it. Using the above scheme has the following benefits:
I don't know how this solution can address #205 (comment) brought up by @yarikoptic. However, I think that can be addressed separately. |
Is making everything optional the correct approach? What about foundational fields like
I disagree on this point. The API can still return "schema B metadata" for objects in unpublished versions, and providing a structured, typed representation is desirable. Otherwise, the vast majority of DANDI metadata can only be manipulated via raw, unstructured dicts like a JavaScript programmer. |
If there are fields that are required even for unfinalized (unfinished) metadata instances, making every fields optional in Schema B is not the correct approach. However, we can selectively keep those required fields as required in generation of Schema B. Going the other way, I think there are also some string fields on which we enforce required formats; do we want those relaxed for Schema B? If the format requirements are enforced in the JSON schema level (i.e. specified in Schema A), they should be available in Schema B as well. The only differences between Schema A and B are the differences in what fields are required.
You are right. If unfinalized metadata instances are to be made available, it would be desirable to make Schema B available as well. The down side is that doing so would make two JSON schemas, both A and B, public for each Dandi schema model. However, the upside is that the schemas correctly document the return data of the corresponding API endpoints. My point really is that we can generate a version of JSON schema for a Pydantic model that allows incomplete data instances to facilitate users to provide metadata in multiple sessions. Use Pydantic model (also Schema A if want to be thorough) to fully validate the data instances when they are ready to be finalized. In this way, we always know that
Note: In the discussion I used "unfinalized" to denote metadata instances that are yet to be completed by the user. I suspected that the word "unpublished" metadata instances may have a slightly different meaning. Lastly, it would be helpful if someone could provide a detailed answer to the first question in my previous post: why are we currently accepting unvalidated metadata to build Dandischema model objects? Furthermore, what are the desired behaviors we aim to achieve in the models, and which current behaviors are we willing to forgo? |
@candleindark - perhaps use linkml mixins and create different models with increasing requirements. it may be helpful to have a clear understanding of the workflow of upload/update/GET requests and based on that decide where the models should be split or requirements changed. |
taking a look at the validation code now, and it looks like another way you could get a) not duplicated models, b) optional values but still validated for type/value correctness if present, c) still have some required fields like expand/collapse code examplefrom typing import Annotated, Any, TypeAlias, TypeVar
from pydantic import (
BaseModel,
ConfigDict,
ValidationInfo,
ValidatorFunctionWrapHandler,
WrapValidator,
)
def _optional_if_draft(
value: Any, handler: ValidatorFunctionWrapHandler, info: ValidationInfo
) -> Any:
if (
isinstance(info.context, dict)
and info.context.get("status", None) == "draft"
and value is None
):
return value
else:
return handler(value)
T = TypeVar("T")
DraftOptional: TypeAlias = Annotated[T, WrapValidator(_optional_if_draft)]
class DandiModel(BaseModel):
model_config = ConfigDict(validate_default=True)
# still required
id: str
# required by default, but optional if draft mode
# default avoids pre-emptive 'missing' check
other_value: DraftOptional[int] = None
# validates normally if validating by instantiation
# this raises a validation error
_ = DandiModel(id="12345")
# optional fields are optional when in draft mode
# this works
_ = DandiModel.model_validate({"id": "12345"}, context={"status": "draft"})
# even in draft mode, invalid values are rejected
_ = DandiModel.model_validate(
{"id": "12345", "other_value": 1.234}, context={"status": "draft"}
) that might work in a pinch. ultimately it seems like it's more of an architectural question of dividing validation logic from input, but just volunteering that there are a decent number of different ways that this can be handled in pydantic without needing to juggle a bunch of json schema :) |
such line of thinking is inline also with thinking of a solution for "vendoring" I brought up in #274 : having the base, non-vendored, model validating general format of IDs/URLs to be legit, whenever particular vendor mixing (e.g. dandi one) would add additional constraints (e.g. |
Note: @satra filed a spiritually-identical issue at almost the same time: #204
(This is an accumulation of various things discussed elsewhere which need to be written down.)
Currently, it is possible for users of the Dandi Archive API to submit asset & Dandiset metadata that does not fully conform to the relevant dandischema model, and the Archive will accept, store, and return such flawed metadata, largely via use of
DandiBaseModel.unvalidated
(being replaced by Pydantic'sconstruct
in #203). I believe part of the motivation for this is so that web users can fill in metadata over multiple sessions without having to fill in every field in a single sitting.This results in API requests for asset & Dandiset metadata sometimes returning values that do not validate under dandischema's models; in particular, if a user of dandi-cli's Python API calls
get_metadata()
instead ofget_raw_metadata()
, the call may fail because our API returned metadata that doesn't conform to our own models (See dandi/dandi-cli#1205 anddandi/dandi-cli#1363).
The dandischema models should therefore be overhauled as follows:
There should exist models for representing Dandiset & asset metadata in a draft/unfinished state. These models should accept all inputs that we want to accept from users (both via the API and the web UI), store in the database, and return in API responses. (It is likely that such models will have all of their fields marked optional aside from the absolute bare minimum required.)
get_metadata()
methods of dandi-cli's Python API should return instances of these models.There should exist functionality for determining whether an instance of a draft/unfinished model meets all of the requirements for Dandiset publication.
CC @satra @dandi/dandiarchive @dandi/dandi-cli
The text was updated successfully, but these errors were encountered: