-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dx 32 nested types #17
base: master
Are you sure you want to change the base?
Conversation
@@ -43,6 +43,7 @@ def above_threshold(self): | |||
"database": "table_schema", | |||
"table": "table_name", | |||
"column": "column_name", | |||
"type": "data_type", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed the columns from the source SQL query, so you don't need this rename any more
).withColumn("effective_timestamp", func.current_timestamp()) | ||
# merge using scd-typ2 | ||
logger.friendly(f"Update classification table {self.classification_table_name}") | ||
|
||
self.classification_table.alias("target").merge( | ||
staged_updates_df.alias("source"), | ||
"target.table_catalog <=> source.table_catalog AND target.table_schema = source.table_schema AND target.table_name = source.table_name AND target.column_name = source.column_name AND target.tag_name = source.tag_name AND target.current = true", | ||
"target.table_catalog <=> source.table_catalog AND target.table_schema = source.table_schema AND target.table_name = source.table_name AND target.column_name = source.column_name AND target.data_type = source.data_type AND target.tag_name = source.tag_name AND target.current = true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to match on data_type?
|
||
|
||
@dataclass | ||
class TaggedColumn: | ||
name: str | ||
data_type: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need the full_data_type
, which contains the full definition of the composed columns
@@ -66,6 +66,15 @@ def compile_msql(self, table_info: TableInfo) -> list[SQLRow]: | |||
temp_sql = msql | |||
for tagged_col in tagged_cols: | |||
temp_sql = temp_sql.replace(f"[{tagged_col.tag}]", tagged_col.name) | |||
# TODO: Can we avoid "replacing strings" for the different types in the future? This is due to the generation of MSQL. Maybe we should rather generate SQL directly from the search method... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should do that instead.
col_name_splitted = col_name.split(".") | ||
return ".".join(["`" + col + "`" for col in col_name_splitted]) | ||
|
||
def recursive_flatten_complex_type(self, col_name, schema, column_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, wow! This was more complex than I expected.
{"col_name": self.backtick_col_name(col_name + "." + field.name), "type": "string"} | ||
) | ||
elif type(field.dataType) in self.COMPLEX_TYPES: | ||
column_list = self.recursive_flatten_complex_type(col_name + "." + field.name, field, column_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should be appending to column_list instead of replacing column_list. Otherwise you overwrite previously appended string types
@@ -23,3 +23,6 @@ hive_metastore,default,tb_all_types,str_part_col,STRING,1 | |||
,default,tb_1,ip,STRING, | |||
,default,tb_1,mac,STRING, | |||
,default,tb_1,description,STRING, | |||
,default,tb_2,active,BOOLEAN, | |||
,default,tb_2,categories,"map<string,string>", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we need to add a "full_data_type" column to reflect the UC structure
A first version supporting complex types (struct with any level and arrays up to the first level only). Structs are scanned recursively but as soon as an array (or map) is found we don't continue for that column.
Still requires some code refactoring but it would be great if you could review the approach.
Steps remaining: