-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
annotations experiment #942
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
@@ -65,6 +66,11 @@ def ensure_table_schema_columns(columns: TAnySchemaColumns) -> TTableSchemaColum | |||
isinstance(columns, pydantic.BaseModel) or issubclass(columns, pydantic.BaseModel) | |||
): | |||
return pydantic.pydantic_to_table_schema_columns(columns) | |||
elif isinstance(columns, type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no support for table level hints here yet
# run simple pipeline and see wether schema was used | ||
load_info = p.run(data, columns=Items, table_name="blah") | ||
print(load_info) | ||
print(p.default_schema.to_pretty_yaml()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The items example will produce this segment in the final schema:
blah:
columns:
id:
data_type: text
primary_key: true
unique: true
name:
data_type: text
nullable: true
x-classifiers:
- pii.name
email:
data_type: text
nullable: true
unique: true
x-classifiers:
- pii.email
likes_herring:
data_type: bool
x-classifiers:
- pii.food_preference
_dlt_load_id:
data_type: text
nullable: false
_dlt_id:
data_type: text
nullable: false
unique: true
write_disposition: append
def to_full_type(t: Type[Any]) -> TColumnSchema: | ||
result: TColumnSchema = {} | ||
if get_origin(t) is Union: | ||
for arg in get_args(t): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the last type in Union
will override all previous values should aggregate things somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can only have one type in the schema, so either we have a default way of resolving if there are multiple types or we throw an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or types of int and string will produce a string, but I think that is taking it to far for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood
dlt/common/schema/annotations.py
Outdated
# | ||
|
||
TypeMap: Dict[Any, TDataType] = { | ||
str: "text", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have py_type_to_sc_type. look at the type_helpers.py. tons of edge cases are handled there when converting types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, great :) I was wondering if we had something like this somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is soo cool
|
||
|
||
def unwrap(t: Type[Any]) -> Tuple[Any, List[Any]]: | ||
"""Returns python type info and wrapped types if this was annotated type""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at typing.py in common, I think I have similar function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i extracted the essential part.
test_pipeline.py
Outdated
class Items: | ||
|
||
# metadata for the table, currently not picked up by the pipeline | ||
__table__: Annotated[Never, a.TableName("my_items"), a.WriteDisposition("merge")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm there must be a better way! ie.
load_info = p.run(data, columns=Annotated[Items, a.TableName("my_items"), a.WriteDisposition("merge")] , table_name="blah")
if items are not annotated then we go to default: Items as name, append and write dispositions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also using columns
for the above is a kind of abuse... let's sync on table level hints. they may require to change our core library. ie. to make resource a generic that takes model as T. also see: #753
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we add "model" to resource definition. but then we are going into a big overhaul of our schema system where relational and python schemas are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in my view we should rename the columns to "model" or "table" and allow a TableSchema or even multiple table schemas (to cover subtables) in there. If a list of columns is detected we can fall back to the old way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i changed it to accept annotated tables now, so this would work.
id: Annotated[str, a.PrimaryKey, a.Unique] | ||
|
||
# additional columns | ||
name: Annotated[Optional[str], a.Classifiers(["pii.name"])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably generate literals for those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean for the classifiers?
allow annotation of table arg
Description
This PR has an example of how we could leverage python annotations to have a more universal and more readable format of our schema.
Check out test_pipeline.py on what the interface for the user would be in this example. It actually already works. Basically all you need to to is to put our annotations on class vars, and you can use any new or pre-existing class as a basis for our schema. Typehints that are unknown to us will be ignored (as defined per "Annotated" PEP)
Notes and Considerations: