Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

annotations experiment #942

Closed
wants to merge 3 commits into from
Closed

annotations experiment #942

wants to merge 3 commits into from

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Feb 7, 2024

Description

This PR has an example of how we could leverage python annotations to have a more universal and more readable format of our schema.

Check out test_pipeline.py on what the interface for the user would be in this example. It actually already works. Basically all you need to to is to put our annotations on class vars, and you can use any new or pre-existing class as a basis for our schema. Typehints that are unknown to us will be ignored (as defined per "Annotated" PEP)

Notes and Considerations:

  • Not all hints and data_types are support here, this is a prototype, but one that is easy to extend
  • There is a meta data attribute "table" on the class, we can use this to set table level hints, currently the dlt core is not set up to support this via the columns attribute, but I think it should be quite easy to do.

Copy link

netlify bot commented Feb 7, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 7547aa8
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/65c3b66fe5f9280008f8eab6

@@ -65,6 +66,11 @@ def ensure_table_schema_columns(columns: TAnySchemaColumns) -> TTableSchemaColum
isinstance(columns, pydantic.BaseModel) or issubclass(columns, pydantic.BaseModel)
):
return pydantic.pydantic_to_table_schema_columns(columns)
elif isinstance(columns, type):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no support for table level hints here yet

# run simple pipeline and see wether schema was used
load_info = p.run(data, columns=Items, table_name="blah")
print(load_info)
print(p.default_schema.to_pretty_yaml())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The items example will produce this segment in the final schema:

  blah:
    columns:
      id:
        data_type: text
        primary_key: true
        unique: true
      name:
        data_type: text
        nullable: true
        x-classifiers:
        - pii.name
      email:
        data_type: text
        nullable: true
        unique: true
        x-classifiers:
        - pii.email
      likes_herring:
        data_type: bool
        x-classifiers:
        - pii.food_preference
      _dlt_load_id:
        data_type: text
        nullable: false
      _dlt_id:
        data_type: text
        nullable: false
        unique: true
    write_disposition: append

def to_full_type(t: Type[Any]) -> TColumnSchema:
result: TColumnSchema = {}
if get_origin(t) is Union:
for arg in get_args(t):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the last type in Union will override all previous values should aggregate things somehow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can only have one type in the schema, so either we have a default way of resolving if there are multiple types or we throw an error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or types of int and string will produce a string, but I think that is taking it to far for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood

#

TypeMap: Dict[Any, TDataType] = {
str: "text",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have py_type_to_sc_type. look at the type_helpers.py. tons of edge cases are handled there when converting types

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, great :) I was wondering if we had something like this somewhere.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is soo cool



def unwrap(t: Type[Any]) -> Tuple[Any, List[Any]]:
"""Returns python type info and wrapped types if this was annotated type"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at typing.py in common, I think I have similar function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i extracted the essential part.

test_pipeline.py Outdated
class Items:

# metadata for the table, currently not picked up by the pipeline
__table__: Annotated[Never, a.TableName("my_items"), a.WriteDisposition("merge")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm there must be a better way! ie.
load_info = p.run(data, columns=Annotated[Items, a.TableName("my_items"), a.WriteDisposition("merge")] , table_name="blah")

if items are not annotated then we go to default: Items as name, append and write dispositions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also using columns for the above is a kind of abuse... let's sync on table level hints. they may require to change our core library. ie. to make resource a generic that takes model as T. also see: #753

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we add "model" to resource definition. but then we are going into a big overhaul of our schema system where relational and python schemas are different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my view we should rename the columns to "model" or "table" and allow a TableSchema or even multiple table schemas (to cover subtables) in there. If a list of columns is detected we can fall back to the old way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i changed it to accept annotated tables now, so this would work.

id: Annotated[str, a.PrimaryKey, a.Unique]

# additional columns
name: Annotated[Optional[str], a.Classifiers(["pii.name"])]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably generate literals for those

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean for the classifiers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants