-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(data-warehouse): New pipeline WIP #26341
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Some nits but this flow is very nice to reason through. I've added a PR with the charts config for new workers in a comment below. Won't approve just yet until you move this out of WIP
raise ValueError(f"No default value defined for type: {pyarrow_type}") | ||
|
||
|
||
def _update_incrementality(schema: ExternalDataSchema | None, table: pa.Table, logger: FilteringBoundLogger) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: _update_increment_state
?
schema.update_incremental_field_last_value(last_value) | ||
|
||
|
||
def _update_job_row_count(job_id: str, count: int, logger: FilteringBoundLogger) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this outside of the Pipeline class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have kept nearly all utils outside of the class directly to keep the class cleaner and make it more about the flow of steps that ae undertaken
@@ -425,12 +433,18 @@ def _run( | |||
schema: ExternalDataSchema, | |||
reset_pipeline: bool, | |||
): | |||
table_row_counts = DataImportPipelineSync(job_inputs, source, logger, reset_pipeline, schema.is_incremental).run() | |||
total_rows_synced = sum(table_row_counts.values()) | |||
if settings.DEBUG: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can base this off the env var settings.TEMPORAL_TASK_QUEUE = v2-data-warehouse-task-queue
if not primary_keys or len(primary_keys) == 0: | ||
raise Exception("Primary key required for incremental syncs") | ||
|
||
delta_table.merge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will function work on an empty delta table? Asking because it'd clean up this logic a bunch if we could just handle if delta_table is None
before this entire if block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it work, unsure, likely. But I'd like to keep this separate, delta-rs seems to have a bunch of side effects with these different funcs (and different modes
etc) - also we should only be using the merge
function on incremental syncs, things like primary keys are required for incremental syncs but are not for full refresh
for column_name in table.column_names: | ||
column = table.column(column_name) | ||
if pa.types.is_struct(column.type) or pa.types.is_list(column.type): | ||
json_column = pa.array([json.dumps(row.as_py()) if row.as_py() is not None else None for row in column]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope here but an issue I just discovered that might be addressable here. clickhouse s3 can't deserialize a list like ["test"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you more context to this? A support ticket maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,137 @@ | |||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to test all this well eventually but fine for now
@@ -33,6 +34,7 @@ | |||
SYNC_BATCH_EXPORTS_TASK_QUEUE: BATCH_EXPORTS_ACTIVITIES, | |||
BATCH_EXPORTS_TASK_QUEUE: BATCH_EXPORTS_ACTIVITIES, | |||
DATA_WAREHOUSE_TASK_QUEUE: DATA_SYNC_ACTIVITIES + DATA_MODELING_ACTIVITIES, | |||
DATA_WAREHOUSE_TASK_QUEUE_V2: DATA_SYNC_ACTIVITIES + DATA_MODELING_ACTIVITIES, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to worklows dict above too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah oops, totally missed this - fixed, thank you!
Problem
Changes
Does this work well for both Cloud and self-hosted?
Yes
How did you test this code?
Units tests and local running