Allow for type inference even in the absence of type generators #1548

cmungall · 2023-07-20T22:41:52Z

cmungall
Jul 20, 2023
Maintainer

Given a schema:

id: https://w3id.org/linkml/compliance/type-inference
name: type_inference
prefixes:
  linkml: https://w3id.org/linkml/
  compliance: https://w3id.org/linkml/compliance/
default_range: string
imports:
  - linkml:types
  
classes:
  Container:
    attributes:
      entities:
        range: Entity
        multivalued: true

  Entity:
    attributes:
      name:
        
  Organization:
    is_a: Entity
    attributes:
      organization_id:
    
  Person:
    is_a: Entity
    attributes:
      person_id:

Should the following data be considered valid?

entities:
  - person_id: P1
  - organization_id: P1

Currently it is not. The schema author has the ability to add a type_designator to allow for dynamic designation of type. UPDATE currently in the generated pydantic this pattern is followed, thanks to @rly for link

However, there is an argument that this should not be necessary. There is only one valid interpretation of the data. Reasoning can be applied at run time to determine the correct interpretation. For an example of a framework that excels at this kind of thing, see cuelang.

Note the above example is trivial. We can imagine scenarios involving chaining through a series of nested objects looking at range constraints, rules, other kinds of constraints, to arrive at a single correct interpretation.

Arguments against:

it's better to always be explicit. Even though in this example the intent is obvious, we can imagine analogous scenarios where the provider intended both elements to be organizations and made a mistake in stating person_id. Adding clever magic can obfuscate validation
the intended behavior is not implemented in common frameworks like pydantic, requiring the templated generation of code intended to disambiguate (I am less sure of this and welcome comments from others)
there could be performance implications (but as this is closed world we are not looking at Description Logic reasoner performance)
overall increase in complexity and difficulty about mentally reasoning over behavior.

Perhaps the most linkml-esque approach here is to be pluralistic. Add a schema metaslot uses_dynamic_type_inference. If this is True, and you can try and generate an artefact that does not support this (e.g. current pydantic classes), a warning or error is raised. But we do not block people who want to try and implement this semantics.

rly · 2023-07-21T01:36:11Z

rly
Jul 21, 2023
Collaborator

I don't have a strong opinion. I favor being explicit, but I recognize that requiring type designators seems inelegant for simple type inference use cases.

Re: Pydantic. Pydantic uses its own version of a type designator but can work without it. Without a type designator, Pydantic will dumbly parse the object as the specified base class or the first matching class out of a union of options. You can change that behavior with custom, complex inference code. Here is a nice write-up: https://blog.devgenius.io/deserialize-child-classes-with-pydantic-that-gonna-work-784230e1cf83 . However, I think that even with this inference code, there are extreme cases where disambiguation is just not possible. For example, if neither organization_id nor person_id were provided and a name was provided instead, then the type cannot be inferred. Or if you changed both organization_id and person_id to id.

1 reply

cmungall Jul 21, 2023
Maintainer Author

Thanks for the links! That's a great outline of different strategies in Pydantic. It looks like the one that works needs a fair amount of boilerplate. Of course, we could autogenerate this but I think we'd want to avoid this by default.

Yes, there will definitely be ambiguous cases, I think the correct behavior here would be to throw an error

rly · 2023-07-21T01:39:50Z

rly
Jul 21, 2023
Collaborator

Somewhat related: how would you add type designators to your example? I can't seem to get it to work... Does it work only when not inlined as list?

Schema:

id: https://w3id.org/linkml/compliance/type-inference
name: type_inference
prefixes:
  linkml: https://w3id.org/linkml/
  compliance: https://w3id.org/linkml/compliance/
default_range: string
imports:
  - linkml:types

classes:
  Container:
    attributes:
      entities:
        range: Entity
        multivalued: true
        inlined_as_list: true  # <-- I added this

  Entity:
    attributes:
      name:
      type:    # <-- I added this
        designates_type: true    # <-- I added this

  Organization:
    is_a: Entity
    attributes:
      organization_id:

  Person:
    is_a: Entity
    attributes:
      person_id:

Data:

entities:
  - person_id: P1
    type: Person    # <-- I added this
  - organization_id: P1
    type: Organization    # <-- I added this

Output:

$ linkml-validate -s test_type_inference.yaml --target-class Container test_type_inference_data.yaml
✗ Additional properties are not allowed ('person_id' was unexpected) in $.entities[0]
✗ Additional properties are not allowed ('organization_id' was unexpected) in $.entities[1]

2 replies

cmungall Jul 21, 2023
Maintainer Author

Hmm, we'll look into this. The inlined_as_list declaration should be superfluous if the listed entities don't have a key/identifier..

rly Jul 21, 2023
Collaborator

You're right - inlined_as_list does not appear to be necessary. I get the same output without it.

Silvanoc · 2023-07-21T14:48:33Z

Silvanoc
Jul 21, 2023
Collaborator

Let me try to describe our use-case (since I'm one of the motivators for this idea).

We are defining a schema to model devices. The goal of such a schema is providing the vocabulary that our software is going to understand and support. Support for additional devices can be added over some extensions. Those extensions can have their own vocabulary derived (import) from our own one. That way we have some sort of "translation" from their domain or device specific language to our agnostic one.

Our software only cares about our classes and slots and those derived (is_a or mixins) from them. But is perfectly fine that derived classes add classes not derived from ours for their own purpose. The same formulated differently: our sofware will be able to interpret the terms that we know or those that can be "translated" to our vocabulary and will ignore the rest.

Let me try to illustrate it with an example (derived from the examples you use on your tests).

Given a base schema:

id: https://example.org/base
name: base
prefixes:
  linkml: https://w3id.org/linkml/
  ourschema: https://example.org/base/
default_range: string
imports:
  - linkml:types
  
classes:
  Container:
    attributes:
      persons:
        range: Person
        multivalued: true

  Event:
    attributes:
      time:
        
  Person:
    is_a: Entity
    attributes:
      name:
      events:
        range: Event
        multivalued: true

And a derived schema

id: https://example.org/derived
name: derived
prefixes:
  linkml: https://w3id.org/linkml/
  ourschema: https://example.org/base/
  derivedschema: https://example.org/derived/
default_range: string
imports:
  - linkml:types
  
classes:
  ExaminationEvent:
    attributes:
      examiner:

  HospitalizationEvent:
    attributes:
      hospital:

And these data

persons:
  - name: someone
    events:
    - time: 2023-07-01
      hospital: The main hospital
    - time: 2023-07-02
      hospital: A small clinic
    - time: 2023-07-03
      examiner: Driving license authorities

Our software will know about Events and will have some business logic to handle the time, but won't care about those other attributes that it cannot understand (hospital and examiner). So we don't need a clear identification of event type, only to know that it's a defined one.

Hopefully it makes it clearer what I'm trying to accomplish to better understand my use-cases.

2 replies

cmungall Jul 21, 2023
Maintainer Author

Actually, I may need a bit more help, sorry!

Currently linkml doesn't have any first class notion of derivation, you can import schemas, and you can inherit classes/slots. There is experimental support for schema derivation/profiling in https://github.com/linkml/linkml-transformer but this would be a whole separate set of concerns from dynamic type inference.

But let's say that your derived schema imported the base schema, and your subclasses inherited from the base classes using a plain is_a. I think that makes perfect sense as a driving use case. In fact we have had various requests to do things like allow extensions of the biolink model KG schema.

But isn't it the case that you would be even more motivated to use type designators here? If we have lots of different groups extending a base schema, the chance of any union of discriminating attributes becoming non-discriminating rises.

In any case, I think this is a valid and interesting use case and we should think through ways to support it. I am tending towards any eventual supporting being an opt-in thing as I don't see how it could be implemented in Python (and other languages!) without some complexity.

rly Jul 21, 2023
Collaborator

+1 for arbitrary user-defined schema extensions. That would be necessary for use in the NWB project.

Regarding type inference in this case, note that Pydantic currently supports parsing the list of event data as instances of the Event class, and not their original subclass (e.g., ExaminationEvent). However, Pydantic does not support dumping of the list of event data to contain the extra fields added by subclasses (without a fair amount of additional code). Example:

from pydantic import BaseModel

class BasePet(BaseModel):
    legs: int

class Cat(BasePet):
    meows: float

class Dog(BasePet):
    barks: float

class Container(BaseModel):
    pets: list[BasePet]

container = Container(pets=[Cat(legs=4, meows=2.718), Dog(legs=3, barks=3.14)])

# dumping model results in loss of fields from subclasses
# parsing this json will result in BasePet objects
print(container.model_dump_json())

# extra fields are OK and ignored during validation
# parsing this json will result in BasePet objects
print(Container.model_validate_json('{"pets": [{"legs": 4, "meows": 1}, {"legs": 3, "barks": 2}]}'))

Output:

{"pets":[{"legs":4},{"legs":3}]}
pets=[BasePet(legs=4), BasePet(legs=3)]

I think the LinkML validator would complain that there are additional properties, so LinkML would need to allow that, perhaps with a new key "allow_additional" or "allow_subclasses":

  Person:
    is_a: Entity
    attributes:
      name:
      events:
        range: Event
        multivalued: true
        allow_additional: true

That seems reasonable to me. But I think I still prefer being explicit with a type designator.

Silvanoc · 2023-07-21T21:46:50Z

Silvanoc
Jul 21, 2023
Collaborator

I would expect the validator not to complain with the newly added flag ˋ--include-range-class-descendantsˋ, contributed by me and merged just yesterday (no release including it yet). It works like the same flag for ˋgen-json-schemaˋ.

…

________________________________ From: Ryan Ly ***@***.***> Sent: Friday, July 21, 2023 10:18:18 PM To: linkml/linkml ***@***.***> Cc: Cirujano Cuesta, Silvano (T CED SES-DE) ***@***.***>; Manual ***@***.***> Subject: Re: [linkml/linkml] Allow for type inference even in the absence of type generators (Discussion #1548) +1 for arbitrary user-defined schema extensions. That would be necessary for use in the NWB project. Regarding type inference in this case, note that Pydantic currently supports parsing the list of event data as instances of the Event class, and not their original subclass (e.g., ExaminationEvent). However, Pydantic does not support dumping of the list of event data to contain the extra fields added by subclasses (without a fair amount of additional code). Example: from pydantic import BaseModel class BasePet(BaseModel): legs: int class Cat(BasePet): meows: float class Dog(BasePet): barks: float class Container(BaseModel): pets: list[BasePet] container = Container(pets=[Cat(legs=4, meows=2.718), Dog(legs=3, barks=3.14)]) # dumping model results in loss of fields from subclasses # parsing this json will result in BasePet objects print(container.model_dump_json()) # extra fields are OK and ignored during validation # parsing this json will result in BasePet objects print(Container.model_validate_json('{"pets": [{"legs": 4, "meows": 1}, {"legs": 3, "barks": 2}]}')) Output: {"pets":[{"legs":4},{"legs":3}]} pets=[BasePet(legs=4), BasePet(legs=3)] I think the LinkML validator would complain that there are additional properties, so LinkML would need to allow that, perhaps with a new key "allow_additional" or "allow_subclasses": Person: is_a: Entity attributes: name: events: range: Event multivalued: true allow_additional: true That seems reasonable to me. But I think I still prefer being explicit with a type designator. — Reply to this email directly, view it on GitHub<#1548 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB74RACJLX7S3ULQKXLH3P3XRLPYVANCNFSM6AAAAAA2SB7ISQ>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

AlasdairGray · 2023-10-06T14:46:16Z

AlasdairGray
Oct 6, 2023

It would be good to get some fully worked up examples around the inlining. I've been trying to get the schema and data on https://linkml.io/linkml/schemas/inlining.html#example to validate and cannot.

I believe this to be the schema

id: https://w3id.org/linkml/examples/organism
name: organism-test-model

prefixes:
  linkml: https://w3id.org/linkml/

imports:
  - linkml:types

classes:
  Organism:
    attributes:
      id:
        identifier: true
      name:
        range: string
      has_subtypes:
        range: Organism
        multivalued: true
        inlined: true

and this to be the data

id: NCBITaxon:40674
name: mammals
has_subtypes:
  - id: NCBITaxon:9443
    name: primates
    has_subtypes:
      - id: NCBITaxon:9606
        name: humans
      - id: NCBITaxon:9682
        name: cats

and the command to be

linkml-validate -s organism-model.yaml organism-data-inline.yaml

but this is the error that I receive

Exception: target class not specified and could not be inferred

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linked data Modeling Language

Allow for type inference even in the absence of type generators #1548

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Linked data Modeling Language

Allow for type inference even in the absence of type generators #1548

cmungall Jul 20, 2023 Maintainer

Replies: 5 comments · 5 replies

rly Jul 21, 2023 Collaborator

cmungall Jul 21, 2023 Maintainer Author

rly Jul 21, 2023 Collaborator

cmungall Jul 21, 2023 Maintainer Author

rly Jul 21, 2023 Collaborator

Silvanoc Jul 21, 2023 Collaborator

cmungall Jul 21, 2023 Maintainer Author

rly Jul 21, 2023 Collaborator

Silvanoc Jul 21, 2023 Collaborator

AlasdairGray Oct 6, 2023

cmungall
Jul 20, 2023
Maintainer

Replies: 5 comments 5 replies

rly
Jul 21, 2023
Collaborator

cmungall Jul 21, 2023
Maintainer Author

rly
Jul 21, 2023
Collaborator

cmungall Jul 21, 2023
Maintainer Author

rly Jul 21, 2023
Collaborator

Silvanoc
Jul 21, 2023
Collaborator

cmungall Jul 21, 2023
Maintainer Author

rly Jul 21, 2023
Collaborator

Silvanoc
Jul 21, 2023
Collaborator

AlasdairGray
Oct 6, 2023