Use of relmodel_transformer for creating data entry forms #2337

cmungall · 2024-10-01T02:16:31Z

cmungall
Oct 1, 2024
Maintainer

Documenting part of a conversation with @ddooley. The context here is the use of DataHarmonizer with nested objects, but this is generally of use to anyone who has need to map from a logical schema that is naturally tree-like (inlined objects) to something more relational/tabular, e.g. for data analysis, data entry (not just RDBMSs).

Assume some data about students and exam results:

students:
  - id: S1
    exam_results:
      - name: MATH
        score: 100
      - name: ENG
        score: 90
  - id: S2
    exam_results:
      - name: MATH
        score: 81
      - name: ENG
        score: 76

Ignore for now how you might render this as a table. The structure here is that one student (identified in a globally unique way by their id) can have zero-to-many exam results.

They can only have one result per exam, so name uniquely identifies the result in the context of any one student.

This is more apparent if we inline as a dict (but this is just a serialization change, it doesn't alter the semantics)

students:
  - id: S1
    exam_results:
      MATH:
        score: 100
      ENG:
        score: 90
  - id: S2
    exam_results:
      MATH:
        score: 81
      ENG:
        score: 76

We would do this by making the student.id be an identifier (globally unique), and exam.name by a key (unique in the context of its parent object).

Again, ignoring table serialization for now. The above is a common pattern, and it can be done simply with identifier and key. We can forbid structures where a student tries to have MATH twice. We also don't accidentally forbid cases where MATH is repeated for two students. This should hold regardless of serialization.

The schema for this is straightforward:

classes:
  Student:
    attributes:
      id:
        identifier: true
      full_name:
      # other atts here
      exam_results:
        range: ExamResult
          inlined: true
          inlined_as_list: true
  ExamResult:
    attributes:
      name:
        key: true
      score:
        range: integer
      additional_notes:

Now let's consider a tabular serialization. In LinkML there is a relmodel transformer that is used for generating SQL DDL but is completely independent of SQL. It just rewrites schemas such that nesting/inlining and multiple values are replaced by backreferences.

It's easiest to see the transform by looking at the generated schema

CREATE TABLE "student" (
    id TEXT NOT NULL,
    full_name TEXT,
    PRIMARY KEY (id)
);
CREATE TABLE "exam_result" (
    "student_id" TEXT,
    FOREIGN KEY("student_id") REFERENCES "student" (id),
    name TEXT NOT NULL,
    additional_notes TEXT,
    PRIMARY KEY (student_id, name),
);

Note that in the source schema we don't have any need of "student_id" in ExamResult, it's "owned" by the parent. But in SQL or in DataHarmonizer there is no concept of inlining, so the transform auto-introduces a foreign key "back reference". And we can auto-infer that the tuple of (student_id, name) is unique because (a) name is unique in the context of the owning object (b) student_id uniquely identifies the owning object in a non-inlined context.

Here's the actual linkml output of the relmodel transformer:

ExamResult:
    attributes:
      name:
        key: true
        range: string
        required: true
      score:
        range: integer
      additional_info:
        range: string
      Student_id:
        annotations:
          backref:
            tag: backref
            value: 'true'
          rdfs:subPropertyOf:
            tag: rdfs:subPropertyOf
            value: rdf:subject
          primary_key:
            tag: primary_key
            value: 'true'
          foreign_key:
            tag: foreign_key
            value: Student.id
        description: Autocreated FK slot
        slot_uri: rdf:subject
        range: Student
    unique_keys:
      Student_name:
        unique_key_slots:
        - Student_id
        - name

Actual records will look like

STUDENT:

id	full_name
s1	...
s2	...

EXAM_RESULT:

Student_id	name	score	...
s1	MATH	100	...
...	...	...	...

Any form-based data entry tool like DH that doesn't have a native concept of a "parent" object could use this transform. Use the relmodel transformer to make a table-friendly version of the schema that has the backrefs and unique keys inserted. The person authoring the source schema only worries about logical constraints. These will get translated correctly to the relational schema. The form tool can just use this directly. It can use the existing compound key mechanism over the transformed schema without need of any esoteric extensions. It can save JSON that is directly conformant with the relational schema, or it can use the standard transform to use the source schema with nesting.

TODOs:

Check that relmodel_transformer creates a unique compound key as per the above

ddooley · 2024-10-02T15:09:29Z

ddooley
Oct 2, 2024

The DataHarmonizer development crew is looking forward to implementing the above primary_key and foreign_key annotations via relmodel_transformer. I also want to check to see if relmodel_transformer also handles sufficient annotations for detailing compound keys, where say in above example the same schema accommodated a table "TestResult" with primary key school_id x student_id (unique within school but not across schools) x test_id.

1 reply

cmungall Oct 2, 2024
Maintainer Author

Can you say more about what the source linkml schema (pre-transform) would look like?

ddooley · 2024-10-02T16:52:27Z

ddooley
Oct 2, 2024

Well, cutting straight to our use-case, we're trying to manage not just one schema but a number of schemas within one overall meta-schema. Envision say an online library of schemas and their main class, enum, slot attributes. Within DH, mainly trying to offer users the experience of editing one schema, but being able to load others side by side into the same data structure to provide relatively easy copy/paste behaviour (with possibly editing of other schemas within the bundle too, which facilitates publishing them all in a bundle.) Below is just a core slice of the meta schema, focusing just on the unique keys, which reference slots/fields that other tables/classes contain without including them below. This is currently unworkable since in unique_key_slots we can't convey which source class/schema a key slot "comes from", i.e. back references as foreign key. Also it is not yet coded to the relmodel transformer you used above.

classes:
Schema:
name: Schema
description: The top-level description of a LinkML schema. A schema contains classes for describing one or more DataHarmonizer templates, fields/columns, and picklists (which are themselves LinkML classes, slots, and enumerations)
unique_keys:
schema_id:
description: A schema is uniquely identified by its URI
unique_key_slots:
- id

Prefix:
name: Prefix
description: A prefix used in the URIs mentioned in this schema.
unique_keys:
prefix_id:
description: A prefix is uniquely identified by its Schema and prefix
unique_key_slots:
- schema_id
- prefix

Enum:
name: Enum
description: One or more enumerations in given schema. An enumeration can be used in the "range" or "any of" attribute of a slot. Each enumeration has a flat list or hierarchy of permitted values.
unique_keys:
enum_id:
description: An enumeration is uniquely identified by the schema it appears in as well as its name.
unique_key_slots:
- schema_id
- name

PermissibleValue:
name: PermissibleValue
description: An enumeration picklist value.
unique_keys:
permissible_value_id:
description: A permissible value is uniquely identified by the schema it appears in as well as its name.
unique_key_slots:
- enum_id
- name

The permissible_value_id key shows a reliance on an enum_id which is itself composed of an Enum class name, as well as a schema_id. I had assumed that a normalized PermissibleValue table needed three separate slots/fields - schema_id, enum_id, and name to manage this in a flat table/SQL way. Enums from different schemas may have the same Enum.name .

0 replies

ddooley · 2024-10-04T23:15:15Z

ddooley
Oct 4, 2024

So we've tested out

gen-sqltables --relmodel-output schema_relational.yaml schema.yaml

On a simpler case where we have the two classes, GRDISamples, and AMRTest, where there are multiple AMR antibiotic tests performed on a given GRDISample.

and I can see it adds good annotations to the schema for primary / foreign key in GRDISample and AMRTest. However, for our DH purpose it goes to far in creating another class in the schema for each slot that has an Enumeration in its range. So I think we will selectively add the annotations in the same way that gen-sqltables is adding them to the schema_relational.yaml, but will not use gen-sqltables directly, so we avoid all the other class generation, since this is not needed for DataHarmonization rendering and functionality with the enum pick lists.

2 replies

cmungall Oct 5, 2024
Maintainer Author

we can add a simple CLI wrapper for the transformer, bypassing the need for a SQL DDL generator
I don't think it should be creating a table for each enum. (It's possible that some SQLA dialects get treated this way but this doesn't matter as you just care about the transformed schema). Can you provide an example?

cmungall Oct 5, 2024
Maintainer Author

perhaps these are multivalued fields - in which case the fact the range is an enum isn't relevant. The pure relational model doesn't have any concept of multivalued columns in tables, these are canonically transformed to new tables. But this can be made optional for ranges that are non-classes. Useful for duckdb etc too.

ddooley · 2024-10-08T19:53:00Z

ddooley
Oct 8, 2024

Well, for GRDISample_Environmental_Site etc, ah yes they are all multivalued, so I see now that's what is triggering table generation! It would be great if that became an optional feature.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linked data Modeling Language

Use of relmodel_transformer for creating data entry forms #2337

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Linked data Modeling Language

Use of relmodel_transformer for creating data entry forms #2337

cmungall Oct 1, 2024 Maintainer

Replies: 4 comments · 3 replies

ddooley Oct 2, 2024

cmungall Oct 2, 2024 Maintainer Author

ddooley Oct 2, 2024

ddooley Oct 4, 2024

cmungall Oct 5, 2024 Maintainer Author

cmungall Oct 5, 2024 Maintainer Author

ddooley Oct 8, 2024

cmungall
Oct 1, 2024
Maintainer

Replies: 4 comments 3 replies

ddooley
Oct 2, 2024

cmungall Oct 2, 2024
Maintainer Author

ddooley
Oct 2, 2024

ddooley
Oct 4, 2024

cmungall Oct 5, 2024
Maintainer Author

cmungall Oct 5, 2024
Maintainer Author

ddooley
Oct 8, 2024