Refactor LoadInfo metrics layout schema #1046

sultaniman · 2024-03-04T13:19:57Z

This issue was reported by a community member in slack and related issue #1043

When we capture load_info data in the destination database the following occurs:

Pipeline execution slows down significantly after x number of incremental pipeline runs (it is back to the original speed when load_info is not captured in the database). My specific job (sourcing data from rest API capturing it in MotherDuck) slows down from 1.5 minutes to over 10 minutes (after a few hundred runs) and seems to keep getting slower.
Each pipeline run creates a new table with one record (at least in my simple pipeline) with names like _load_info__metrics___1709356777_4431682.

TODO

Adjust _ExtractInfo.metrics from Dict[str, List[ExtractMetrics]] to just List[ExtractMetrics],
Add load_id field to StepMetrics,
Adjust dependent code to use and extract information collection to lookup load ids etc.

netlify · 2024-03-04T13:20:13Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`f428b25`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65e5e3eb97ffab00081467e7

rudolfix · 2024-03-04T14:42:03Z

tests/pipeline/test_pipeline_load_info.py

+
+    load_info = pipeline.run(data, table_name="users")
+
+    pipeline.run([load_info], table_name="_load_info")


please add load_info, normalize_info and extract_info.

please load to a separate schema (you have schema argument to run()

use this second schema to compare hashes (mind that it wont be a default)

rudolfix · 2024-03-04T14:55:58Z

dlt/common/pipeline.py

@@ -61,8 +61,12 @@ class _StepInfo(NamedTuple):
 class StepMetrics(TypedDict):
    """Metrics for particular package processed in particular pipeline step"""

+    load_id: str


this is OK and overall even better than a dictionary. But in general the shape of the data is changed in asdict() like here:

def asdict(self) -> DictStrAny: # to be mixed with NamedTuple d: DictStrAny = self._asdict() # type: ignore d["pipeline"] = {"pipeline_name": self.pipeline.pipeline_name} d["load_packages"] = [package.asdict() for package in self.load_packages] if self.metrics: d["started_at"] = self.started_at d["finished_at"] = self.finished_at return d

and the problem was that we didn't reformat metric to convert form dict to list.

rudolfix · 2024-03-04T14:57:22Z

tests/pipeline/test_pipeline_load_info.py

+        dataset_name="mydata",
+    )
+
+    load_info = pipeline.run(data, table_name="users")


please load something more complicated than this. ie. source with a resource that have several hints

rudolfix · 2024-03-04T14:58:02Z

tests/pipeline/test_pipeline_load_info.py

+    pipeline.run([load_info], table_name="_load_info")
+    first_version_hash = pipeline.default_schema.version_hash
+
+    load_info = pipeline.run(data, table_name="users")


here let's load again but we should add another source with different resource and some schema hints

rudolfix · 2024-03-04T14:58:41Z

tests/pipeline/test_pipeline_load_info.py

+    first_version_hash = pipeline.default_schema.version_hash
+
+    load_info = pipeline.run(data, table_name="users")
+    pipeline.run([load_info], table_name="_load_info")


you may have a schema difference when loading extract_info and the new resource has a new hint type. then indeed we may add column dynamically.

sultaniman · 2024-03-05T12:27:55Z

closing in favor of #1051

Test if load info metrics always results in a single table

3b1f4c4

sultaniman added bug Something isn't working community This issue came from slack community workspace labels Mar 4, 2024

sultaniman requested review from sh-rp, rudolfix and z3z1ma March 4, 2024 13:19

sultaniman self-assigned this Mar 4, 2024

sultaniman added 5 commits March 4, 2024 14:35

Add test to check if version_hash is consistent for load info schema

1ad601a

Remove redundant test case

061e95f

Remove redundant test code

3ddb38b

Add load_id to StepMetrics

b86927a

Adjust access and processing of StepMetrics

3f1196f

rudolfix requested changes Mar 4, 2024

View reviewed changes

Update places using self._step_info_metrics

f428b25

sultaniman closed this Mar 5, 2024

sultaniman deleted the issue-1043 branch March 5, 2024 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor LoadInfo metrics layout schema #1046

Refactor LoadInfo metrics layout schema #1046

sultaniman commented Mar 4, 2024 •

edited

Loading

netlify bot commented Mar 4, 2024 •

edited

Loading

rudolfix Mar 4, 2024

rudolfix Mar 4, 2024

rudolfix Mar 4, 2024

rudolfix Mar 4, 2024

rudolfix Mar 4, 2024

sultaniman commented Mar 5, 2024


		load_info = pipeline.run(data, table_name="users")

		pipeline.run([load_info], table_name="_load_info")

Refactor LoadInfo metrics layout schema #1046

Refactor LoadInfo metrics layout schema #1046

Conversation

sultaniman commented Mar 4, 2024 • edited Loading

TODO

netlify bot commented Mar 4, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

rudolfix Mar 4, 2024

Choose a reason for hiding this comment

rudolfix Mar 4, 2024

Choose a reason for hiding this comment

rudolfix Mar 4, 2024

Choose a reason for hiding this comment

rudolfix Mar 4, 2024

Choose a reason for hiding this comment

rudolfix Mar 4, 2024

Choose a reason for hiding this comment

sultaniman commented Mar 5, 2024

sultaniman commented Mar 4, 2024 •

edited

Loading

netlify bot commented Mar 4, 2024 •

edited

Loading