How can I use `SchemaView` to get the "effective" range of a slot? #2101

eecavanna · 2024-05-08T01:09:24Z

eecavanna
May 8, 2024

Hi, this is my first time posting a question here.

Background

I develop an application that interacts with a database that complies with a LinkML schema. That LinkML schema is: nmdc-schema.

I am currently writing a Python script that uses an instance of LinkML's SchemaView Python class to traverse that LinkML schema. I am particularly interested in identifying (and learning about) slots that can contain references to other slots. This is all in pursuit of implementing some referential integrity checks.

That nmdc-schema schema contains the definition of a class named Extraction. Based on a conversation with a teammate and my limited understanding of LinkML, I am under the impression that the has_input slot (in the context of the Extraction class, specifically) must consist of [a reference to] either a Biosample or a ProcessedSample, despite the default range of the has_input slot being NamedThing.

Question

How can I use a SchemaView instance to get that list: Biosample and ProcessedSample?

Here's a screenshot of a Python notebook where I've invoked a few SchemaView methods:

A) slot_definition.range (where slot_definition is returned by .induced_slot()) gives me the default range of the slot
B) slot_definition.any_of[0].range and [1].range give me the two class names in the any_of list
C) schema_view.slot_range_as_union( . . . ) gives me all three class names

Assuming my impression (stated in the "Background" section above) is correct, method (B) is the only one that gives me what I'm looking for. However, I only knew to look at the any_of list after a teammate pointed it out to me. Is there a more general way a SchemaView instance can be used to get a "fully resolved" list of the classes whose instances a slot can contain [references to] (instead of "manually" checking for the presence of a boolean expression; e.g. any_of, none_of)?

Footnote for future readers: I work with some of the LinkML developers and have made some assumptions about their familiarity with my situation.

sneakers-the-rat · 2024-05-11T05:21:25Z

sneakers-the-rat
May 11, 2024
Collaborator

This seems like a bug to me.

slot_usage is supposed to refine the usage of a slot in the context of a given class - to extend or override the default slot definition (tho @cmungall has said that linkml should be monotonic).

so i've been wanting to take a crack at rewriting the induced_slot method for awhile because it's a pretty critical one (among the others that handle inheritance and overides) for my purposes, and I think I can see where this is coming from. I would expect induced_slot to work in a "ancestor-wise" way - for each of the ancestor definitions/redefinitions of a slot, resolve each of those from the tips of the tree down to the slot in question. Instead it works in a "metaslot-wise" way where for each slot we iterate through the metaslots and then do an ancestor-wise pass. This also explains a bit why induced_slot is one of the methods that by itself takes the most time in the library.

The any_of construction makes sense why it is the way it is (like JSON Schema), but I think that's a syntactic thing that should def be smoothed out in the interface. The any_of needs to be computed along with the range for each ancestor layer, but currently they aren't, so the SlotDefinition having range = 'NamedThing' and the any_of from the slot_usage is incorrect.

So when resolving/inducing a slot, we should always make range behave like slot_range_as_union (except without the erroneously included NamedThing).

Here's another illustrative example that imo is buggy:

id: test_schema
name: test
imports:
  - linkml:types

classes:
  MyClass:
    attributes:
      my_slot:
        any_of:
          - range: string
          - range: int

>>> sv = SchemaView('/Users/jonny/Desktop/test_schema.yaml')
>>> slot = sv.induced_slot('my_slot', 'MyClass')
>>> type(slot.range)
NoneType

This is getting a bit at the difference between the metamodel python classes which are just representations of the schema yaml objects and what is produced by SchemaView. The purpose (i think?) of SchemaView is to provide an interface to the schema s.t. the slots and classes "behave like they should" based on the metamodel in ways that need to be materialized above the literal representation in the yaml. Schemaview not behaving this way makes for a lot of awkward moments in the generators, eg PydanticGenerator.range_class_has_identifier_slot (which should be in SchemaView) instead of checking all the items in range we need to do a separate iteration over any_of. Note that we are only checking any_of there rather than also handling exactly_one_of or all_of which can also have ranges, so that's incomplete behavior. Lots of other examples across the generators.

Distressingly this causes the model to be generated incorrectly by pydanticgen:

class Extraction(PlannedProcess):
    """
    A material separation in which a desired component of an input material is separated from the remainder.
    """
    # ... other stuff ...
    has_input: List[str] = Field(default_factory=list, description="""An input to a process.""")

which is none of the three options in the OP. That's because while it correctly gets the any_of range, the range classes Biosample and ProcessedSample have an identifier slot id and aren't marked to be inlined.

pythongen also gets it incorrect, but in a different way:

@dataclass
class Extraction(PlannedProcess):
    """
    A material separation in which a desired component of an input material is separated from the remainder.
    """
    has_input: Union[Union[str, NamedThingId], List[Union[str, NamedThingId]]] = None

where it misses the any_of range in the slot usage.

So bc this isn't resolved well by SchemaView, there are divergent errors in the generators. Things downstream from generators should never have to worry about inheritance, ancestry, etc - by the time the schema reaches them it should already be "cooked" (using the parlance used in a few places in the library). The simplest fix here would just be to clear range or any_of when one or the other is defined in an ancestor class in induced_slot, but the longer term fix will be to make SchemaView recursive, both within a schema and across its imports, where each class and slot in the dependency tree is fully resolved once and each inheriting/extending class/slot/etc. resolves itself relative to its parents. In this case, then, when materializing the has_input slot, we would just be looking at the definition of has_input in PlannedProcess, and applying a single apply_slot_usage method that knows how to both overwrite scalar rules as well as resolve any_of, exactly_one_of etc. between a parent and child class, and the problem is resolved for all downstream use.

0 replies

cmungall · 2024-05-13T15:01:01Z

cmungall
May 13, 2024
Maintainer

This isn't a bug though... `range` is single-valued, so None/Any is the most specific single valued range that can be returned. But the behavior is incomplete. We want slot induction to populate `range_expression` with the non-redundant entailed expression. There is code in the main repo https://github.com/linkml/linkml/blob/main/linkml/transformers/logical_model_transformer.py that could be used here.

…

On Fri, May 10, 2024 at 10:21 PM Jonny Saunders ***@***.***> wrote: This seems like a bug to me. slot_usage is supposed to refine the usage of a slot in the context of a given class - to extend or override the default slot definition (tho @cmungall <https://github.com/cmungall> has said that linkml should be monotonic <#1962>). so i've been wanting to take a crack at rewriting the induced_slot method for awhile because it's a pretty critical one (among the others that handle inheritance and overides) for my purposes, and I think I can see where this is coming from. I would expect induced_slot to work in a "ancestor-wise" way - for each of the ancestor definitions/redefinitions of a slot, resolve each of those from the tips of the tree down to the slot in question. Instead it works in a "metaslot-wise" way <https://github.com/linkml/linkml-runtime/blob/c7815cb81c539f919e2ec48d2ca38d06da5aeb18/linkml_runtime/utils/schemaview.py#L1352> where for each slot we iterate through the metaslots and *then* <https://github.com/linkml/linkml-runtime/blob/c7815cb81c539f919e2ec48d2ca38d06da5aeb18/linkml_runtime/utils/schemaview.py#L1360> do an ancestor-wise pass. This also explains a bit why induced_slot is one of the methods that by itself takes the most time in the library. The any_of construction makes sense why it is the way it is (like JSON Schema), but I think that's a syntactic thing that should def be smoothed out in the interface. The any_of needs to be computed along with the range for each ancestor layer, but currently they aren't, so the SlotDefinition having range = 'NamedThing' *and* the any_of from the slot_usage is incorrect. So when resolving/inducing a slot, we should *always* make range behave like slot_range_as_union. Here's another illustrative example that imo is buggy: id: test_schemaname: testimports: - linkml:types classes: MyClass: attributes: my_slot: any_of: - range: string - range: int >>> sv = SchemaView('/Users/jonny/Desktop/test_schema.yaml')>>> slot = sv.induced_slot('my_slot', 'MyClass')>>> type(slot.range)NoneType This is getting a bit at the difference between the metamodel python classes which are just representations of the schema yaml objects and what is produced by SchemaView. The purpose (i think?) of SchemaView is to provide an interface to the schema s.t. the slots and classes "behave like they should" based on the metamodel in ways that need to be materialized above the literal representation in the yaml. Schemaview not behaving this way makes for a lot of awkward moments in the generators, eg PydanticGenerator.range_class_has_identifier_slot (which should be in SchemaView) instead of checking all the items in range we need to do a separate iteration over any_of <https://github.com/linkml/linkml/blob/b5313f98ddb508e414e0ec62a75b2744f64fa59d/linkml/generators/pydanticgen/pydanticgen.py#L350-L356>. Note that we are *only* checking any_of there rather than also handling exactly_one_of or all_of which can also have ranges, so that's incomplete behavior. Lots of other examples across the generators. Distressingly this causes the model to be generated incorrectly by pydanticgen: class Extraction(PlannedProcess): """ A material separation in which a desired component of an input material is separated from the remainder. """ # ... other stuff ... has_input: List[str] = Field(default_factory=list, description="""An input to a process.""") which is none of the three options in the OP. That's because while it correctly gets the any_of range, the range classes Biosample and ProcessedSample have an identifier slot id and aren't marked to be inline d. pythongen also gets it incorrect, but in a different way: @dataclassclass Extraction(PlannedProcess): """ A material separation in which a desired component of an input material is separated from the remainder. """ has_input: Union[Union[str, NamedThingId], List[Union[str, NamedThingId]]] = None where it misses the any_of range in the slot usage. So bc this isn't resolved well by SchemaView, there are divergent errors in the generators. Things downstream from generators should never have to worry about inheritance, ancestry, etc - by the time the schema reaches them it should already be "cooked" (using the parlance used in a few places in the library). The simplest fix here would just be to clear range or any_of when one or the other is defined in an ancestor class in induced_slot, but the longer term fix will be to make SchemaView <#1739 (comment)> recursive <#1839>, both within a schema and across its imports, where each class and slot in the dependency tree is fully resolved once and each inheriting/extending class/slot/etc. resolves itself relative to its parents. In this case, then, when materializing the has_input slot, we would just be looking at the definition of has_input in PlannedProcess, and applying a single apply_slot_usage method that knows how to both overwrite scalar rules as well as resolve any_of, exactly_one_of etc. between a parent and child class, and the problem is resolved for all downstream use. — Reply to this email directly, view it on GitHub <#2101 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMMOMASTIFZJXWR34UVWTZBWTG3AVCNFSM6AAAAABHMAFBEWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TGOJSGUZDO> . You are receiving this because you were mentioned.Message ID: ***@***.***>

3 replies

sneakers-the-rat May 13, 2024
Collaborator

I'm saying that it isn't technically a bug because in the metamodel they are different fields (range vs any_of: [range: ...]) but it results in buggy behavior.

The desired behavior is:

given class inheritance tree-

classes:
  Parent:
  Child1:
    is_a: Parent
  Child2:
    is_a: Parent

base slot has range Parent
induced slot has range Union[Child1, Child2]

because the syntax is different to express what is intuitively the "same thing":

range: Parent
# vs
any_of:
  - range: Child1
  - range: Child2

then one gets unexpected outcomes because SchemaView doesn't treat them the same.

I think adding another method here would make things more complicated, because we would basically always want to use that in generators, right? there is (or should be) only one correct range (whether that be an expression or a scalar value) for a given slot in a given class, so there should be one way to get that. there are already two methods (induced_slot and slot_range_as_union ) and the way they are used differently in different generators is causing bugs and inconsistent behavior.

Like i was saying above i think this points to a larger sense of uncertainty about what schemaview does and what the classes/slots/etc. returned by its methods represent. Currently the pythongen metamodel classes are essentially validated in-memory versions of schemas, but there is a lot of logic that needs to happen to that to materialize them. Ideally there would be a clear chain like

schema --{schemaview}-> materialized schema --{generators}-> generated schema

where the role of schemaview is to be the interface to a schema, and you could be confident that everything you get out of schemaview is the 'fully cooked' version of the schema. So eg. a range within a slot within a class should be the same whether one accesses it with

SchemaView.get_class('ClassName').slots['slot_name'].range ,
SchemaView.induced_slot('ClassName', 'slot_name').range

and one wouldn't need a separate method to compute the range_expression as differentiated from scalar range. To get there I think we need to make schemaview recursive so that all those computations are done exactly once for each class/slot/etc. combination instead of spread out across many methods.

cmungall May 13, 2024
Maintainer

Agreed that pythongen confuses thing, it predates schemaview, and used SchemaLoader, which essentially materialized inferences.

SchemaView intends to be very clear about asserted vs entailed. In the following:

SchemaView.get_class('ClassName').slots['slot_name'].range
SchemaView.induced_slot('ClassName', 'slot_name').range

The first operates over asserted structures (in fact this example isn't quite right since slots is a flat list but if you substituted slots for slot_usage or attributes the idea is the same). In general, anything that doesn't have induced in the name operates over the structure.

The second operates over the induced/entailed schema. Roughly "rolling down" is-a hierarchy etc.

Just to be clear, range_expression wouldn't be a new method, this is a metamodel slot: https://w3id.org/linkml/range_expression.

Currently SchemaView is incomplete, because it doesn't populate range_expression.

sneakers-the-rat May 13, 2024
Collaborator

I think I get the idea, I guess another way of saying what im saying is that the entailment getters are not clearly enough delineated from the asserted getters. Ie. I would expect that a simple schema loader to just return the asserted, literal models, but I would expect schemaview to be all entailed. To me it is pretty non-intuitive to need to get_class, loop through its attrs and slots with induced_slot, and now potentially call a third method to get the resolved range of that slot to get a fully materialized class - I would think that get_class should return the class with all its entailments computed for me, which would make using schemas way easier imo.

I get thats a decent amount of work, im trying to make the case that we should have that as a goal :)

sierra-moxon · 2024-05-13T15:19:46Z

sierra-moxon
May 13, 2024
Maintainer

noting that we made an issue out of this discussion last week: #2103

0 replies

cmungall · 2024-05-13T19:22:35Z

cmungall
May 13, 2024
Maintainer

I think addition an option to get_class that materializes the entailments is a great idea!

…

On Mon, May 13, 2024 at 12:18 PM Jonny Saunders ***@***.***> wrote: I think I get the idea, I guess another way of saying what im saying is that the entailment getters are not clearly enough delineated from the asserted getters. Ie. I would expect that a simple schema loader to just return the asserted, literal models, but I would expect schemaview to be all entailed. To me it is pretty non-intuitive to need to get_class, loop through its attrs and slots with induced_slot, and now potentially call a third method to get the resolved range of that slot to get a fully materialized class - I would think that get_class should return the class with all its entailments computed for me, which would make using schemas way easier imo. I get thats a decent amount of work, im trying to make the case that we should have that as a goal :) — Reply to this email directly, view it on GitHub <#2101 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMMOIKG4XYNSHG3IRTEWDZCEGW5AVCNFSM6AAAAABHMAFBEWVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TIMRVGIZTK> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linked data Modeling Language

How can I use `SchemaView` to get the "effective" range of a slot? #2101

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Linked data Modeling Language

How can I use SchemaView to get the "effective" range of a slot? #2101

eecavanna May 8, 2024

Background

Question

Replies: 4 comments · 3 replies

sneakers-the-rat May 11, 2024 Collaborator

cmungall May 13, 2024 Maintainer

sneakers-the-rat May 13, 2024 Collaborator

cmungall May 13, 2024 Maintainer

sneakers-the-rat May 13, 2024 Collaborator

sierra-moxon May 13, 2024 Maintainer

cmungall May 13, 2024 Maintainer

How can I use `SchemaView` to get the "effective" range of a slot? #2101

eecavanna
May 8, 2024

Replies: 4 comments 3 replies

sneakers-the-rat
May 11, 2024
Collaborator

cmungall
May 13, 2024
Maintainer

sneakers-the-rat May 13, 2024
Collaborator

cmungall May 13, 2024
Maintainer

sneakers-the-rat May 13, 2024
Collaborator

sierra-moxon
May 13, 2024
Maintainer

cmungall
May 13, 2024
Maintainer