feat(data-warehouse): New pipeline WIP #26341

Gilbert09 · 2024-11-21T18:19:15Z

Problem

We'd like more control over our DWH import pipeline, we're currently having issues with high memory and high failure rate with our existing pipeline

Changes

Rebuild the pipeline without the usage of DLT
Set up the temporal workflow, "external-data-job", to run both versions of the pipeline concurrently so we can get side-by-side results and test the stability of the new pipeline
- This includes not including any jobs from the new pipeline in billing, nor showing up on the users sync page when they manage a source

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Units tests and local running

EDsCODE

Looks great. Some nits but this flow is very nice to reason through. I've added a PR with the charts config for new workers in a comment below. Won't approve just yet until you move this out of WIP

EDsCODE · 2024-11-21T20:22:18Z

posthog/temporal/data_imports/pipelines/pipeline_non_dlt.py

+        raise ValueError(f"No default value defined for type: {pyarrow_type}")
+
+
+def _update_incrementality(schema: ExternalDataSchema | None, table: pa.Table, logger: FilteringBoundLogger) -> None:


NIT: _update_increment_state?

EDsCODE · 2024-11-21T20:27:53Z

posthog/temporal/data_imports/pipelines/pipeline_non_dlt.py

+    schema.update_incremental_field_last_value(last_value)
+
+
+def _update_job_row_count(job_id: str, count: int, logger: FilteringBoundLogger) -> None:


Why is this outside of the Pipeline class?

Have kept nearly all utils outside of the class directly to keep the class cleaner and make it more about the flow of steps that ae undertaken

EDsCODE · 2024-11-21T20:38:04Z

posthog/temporal/data_imports/workflow_activities/import_data_sync.py

@@ -425,12 +433,18 @@ def _run(
    schema: ExternalDataSchema,
    reset_pipeline: bool,
 ):
-    table_row_counts = DataImportPipelineSync(job_inputs, source, logger, reset_pipeline, schema.is_incremental).run()
-    total_rows_synced = sum(table_row_counts.values())
+    if settings.DEBUG:


Can base this off the env var settings.TEMPORAL_TASK_QUEUE = v2-data-warehouse-task-queue

https://github.com/PostHog/charts/pull/2389

EDsCODE · 2024-11-21T21:12:05Z

posthog/temporal/data_imports/pipelines/pipeline_non_dlt.py

+            if not primary_keys or len(primary_keys) == 0:
+                raise Exception("Primary key required for incremental syncs")
+
+            delta_table.merge(


Will function work on an empty delta table? Asking because it'd clean up this logic a bunch if we could just handle if delta_table is None before this entire if block

Would it work, unsure, likely. But I'd like to keep this separate, delta-rs seems to have a bunch of side effects with these different funcs (and different modes etc) - also we should only be using the merge function on incremental syncs, things like primary keys are required for incremental syncs but are not for full refresh

EDsCODE · 2024-11-21T21:21:03Z

posthog/temporal/data_imports/pipelines/pipeline_non_dlt.py

+    for column_name in table.column_names:
+        column = table.column(column_name)
+        if pa.types.is_struct(column.type) or pa.types.is_list(column.type):
+            json_column = pa.array([json.dumps(row.as_py()) if row.as_py() is not None else None for row in column])


Out of scope here but an issue I just discovered that might be addressable here. clickhouse s3 can't deserialize a list like ["test"]

Do you more context to this? A support ticket maybe?

https://posthoghelp.zendesk.com/agent/tickets/20719

posthog/constants.py

posthog/temporal/data_imports/external_data_job.py

EDsCODE · 2024-11-27T20:30:15Z

posthog/temporal/data_imports/pipelines/pipeline/pipeline.py

@@ -0,0 +1,137 @@
+import time


We'll want to test all this well eventually but fine for now

EDsCODE · 2024-11-28T15:05:37Z

posthog/management/commands/start_temporal_worker.py

@@ -33,6 +34,7 @@
    SYNC_BATCH_EXPORTS_TASK_QUEUE: BATCH_EXPORTS_ACTIVITIES,
    BATCH_EXPORTS_TASK_QUEUE: BATCH_EXPORTS_ACTIVITIES,
    DATA_WAREHOUSE_TASK_QUEUE: DATA_SYNC_ACTIVITIES + DATA_MODELING_ACTIVITIES,
+    DATA_WAREHOUSE_TASK_QUEUE_V2: DATA_SYNC_ACTIVITIES + DATA_MODELING_ACTIVITIES,


Add to worklows dict above too

Ah oops, totally missed this - fixed, thank you!

WIP

98cf84d

EDsCODE self-requested a review November 21, 2024 18:29

EDsCODE reviewed Nov 21, 2024

View reviewed changes

Gilbert09 added 8 commits November 22, 2024 14:21

Restructure new pipeline files

e421353

Updated all sources to take new incremental value

4a7d7b4

Added new temporal queue and pipeline version

39f3dc6

Update migration to backfill

7e8cbdc

Use the correct queue var name

0141923

Use the queue to dictate logic

406ef32

Merge branch 'master' into tom/new-pipelie

a54e1d4

mypy

30205ac

EDsCODE self-requested a review November 27, 2024 17:50

Gilbert09 marked this pull request as ready for review November 27, 2024 18:50

Gilbert09 and others added 3 commits November 27, 2024 18:52

Merge branch 'master' into tom/new-pipelie

3eb532b

mypy

7d7d274

Update query snapshots

2e61ad7

EDsCODE reviewed Nov 27, 2024

View reviewed changes

Gilbert09 added 2 commits November 27, 2024 21:04

Use the correct env var equality test

2e7f44e

Added new queue to management commands

449662f

EDsCODE reviewed Nov 28, 2024

View reviewed changes

Added missing reference

61f4bd3

EDsCODE approved these changes Nov 28, 2024

View reviewed changes

Gilbert09 and others added 4 commits December 2, 2024 11:54

Merge branch 'master' into tom/new-pipelie

824e93f

Use the correct paths for V2 jobs

c96f888

Update query snapshots

27b47e7

Fixes for 100% pass rate in end_to_end tests

8e6ab69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-warehouse): New pipeline WIP #26341

feat(data-warehouse): New pipeline WIP #26341

Gilbert09 commented Nov 21, 2024 •

edited

Loading

EDsCODE left a comment

EDsCODE Nov 21, 2024

EDsCODE Nov 21, 2024

Gilbert09 Nov 22, 2024

EDsCODE Nov 21, 2024

EDsCODE Nov 21, 2024

Gilbert09 Nov 22, 2024

EDsCODE Nov 21, 2024

Gilbert09 Nov 21, 2024

EDsCODE Nov 22, 2024

EDsCODE Nov 27, 2024

EDsCODE Nov 28, 2024

Gilbert09 Nov 28, 2024

		raise ValueError(f"No default value defined for type: {pyarrow_type}")


		def _update_incrementality(schema: ExternalDataSchema \| None, table: pa.Table, logger: FilteringBoundLogger) -> None:

		schema.update_incremental_field_last_value(last_value)


		def _update_job_row_count(job_id: str, count: int, logger: FilteringBoundLogger) -> None:

feat(data-warehouse): New pipeline WIP #26341

Are you sure you want to change the base?

feat(data-warehouse): New pipeline WIP #26341

Conversation

Gilbert09 commented Nov 21, 2024 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

EDsCODE left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gilbert09 commented Nov 21, 2024 •

edited

Loading