Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(data-warehouse): New pipeline WIP #26341

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open

Conversation

Gilbert09
Copy link
Member

@Gilbert09 Gilbert09 commented Nov 21, 2024

Problem

  • We'd like more control over our DWH import pipeline, we're currently having issues with high memory and high failure rate with our existing pipeline

Changes

  • Rebuild the pipeline without the usage of DLT
  • Set up the temporal workflow, "external-data-job", to run both versions of the pipeline concurrently so we can get side-by-side results and test the stability of the new pipeline
    • This includes not including any jobs from the new pipeline in billing, nor showing up on the users sync page when they manage a source

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Units tests and local running

@EDsCODE EDsCODE self-requested a review November 21, 2024 18:29
Copy link
Member

@EDsCODE EDsCODE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Some nits but this flow is very nice to reason through. I've added a PR with the charts config for new workers in a comment below. Won't approve just yet until you move this out of WIP

raise ValueError(f"No default value defined for type: {pyarrow_type}")


def _update_incrementality(schema: ExternalDataSchema | None, table: pa.Table, logger: FilteringBoundLogger) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: _update_increment_state?

schema.update_incremental_field_last_value(last_value)


def _update_job_row_count(job_id: str, count: int, logger: FilteringBoundLogger) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this outside of the Pipeline class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have kept nearly all utils outside of the class directly to keep the class cleaner and make it more about the flow of steps that ae undertaken

@@ -425,12 +433,18 @@ def _run(
schema: ExternalDataSchema,
reset_pipeline: bool,
):
table_row_counts = DataImportPipelineSync(job_inputs, source, logger, reset_pipeline, schema.is_incremental).run()
total_rows_synced = sum(table_row_counts.values())
if settings.DEBUG:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can base this off the env var settings.TEMPORAL_TASK_QUEUE = v2-data-warehouse-task-queue

https://github.com/PostHog/charts/pull/2389

if not primary_keys or len(primary_keys) == 0:
raise Exception("Primary key required for incremental syncs")

delta_table.merge(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will function work on an empty delta table? Asking because it'd clean up this logic a bunch if we could just handle if delta_table is None before this entire if block

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work, unsure, likely. But I'd like to keep this separate, delta-rs seems to have a bunch of side effects with these different funcs (and different modes etc) - also we should only be using the merge function on incremental syncs, things like primary keys are required for incremental syncs but are not for full refresh

for column_name in table.column_names:
column = table.column(column_name)
if pa.types.is_struct(column.type) or pa.types.is_list(column.type):
json_column = pa.array([json.dumps(row.as_py()) if row.as_py() is not None else None for row in column])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope here but an issue I just discovered that might be addressable here. clickhouse s3 can't deserialize a list like ["test"]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you more context to this? A support ticket maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EDsCODE EDsCODE self-requested a review November 27, 2024 17:50
@Gilbert09 Gilbert09 marked this pull request as ready for review November 27, 2024 18:50
posthog/constants.py Show resolved Hide resolved
posthog/temporal/data_imports/external_data_job.py Outdated Show resolved Hide resolved
@@ -0,0 +1,137 @@
import time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to test all this well eventually but fine for now

@@ -33,6 +34,7 @@
SYNC_BATCH_EXPORTS_TASK_QUEUE: BATCH_EXPORTS_ACTIVITIES,
BATCH_EXPORTS_TASK_QUEUE: BATCH_EXPORTS_ACTIVITIES,
DATA_WAREHOUSE_TASK_QUEUE: DATA_SYNC_ACTIVITIES + DATA_MODELING_ACTIVITIES,
DATA_WAREHOUSE_TASK_QUEUE_V2: DATA_SYNC_ACTIVITIES + DATA_MODELING_ACTIVITIES,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to worklows dict above too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah oops, totally missed this - fixed, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants