Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor LoadInfo metrics layout schema #1046

Closed
wants to merge 7 commits into from
Closed

Refactor LoadInfo metrics layout schema #1046

wants to merge 7 commits into from

Conversation

sultaniman
Copy link
Contributor

@sultaniman sultaniman commented Mar 4, 2024

This issue was reported by a community member in slack and related issue #1043

When we capture load_info data in the destination database the following occurs:

  • Pipeline execution slows down significantly after x number of incremental pipeline runs (it is back to the original speed when load_info is not captured in the database). My specific job (sourcing data from rest API capturing it in MotherDuck) slows down from 1.5 minutes to over 10 minutes (after a few hundred runs) and seems to keep getting slower.
  • Each pipeline run creates a new table with one record (at least in my simple pipeline) with names like _load_info__metrics___1709356777_4431682.

TODO

  • Adjust _ExtractInfo.metrics from Dict[str, List[ExtractMetrics]] to just List[ExtractMetrics],
  • Add load_id field to StepMetrics,
  • Adjust dependent code to use and extract information collection to lookup load ids etc.

@sultaniman sultaniman added bug Something isn't working community This issue came from slack community workspace labels Mar 4, 2024
@sultaniman sultaniman requested review from sh-rp, rudolfix and z3z1ma March 4, 2024 13:19
@sultaniman sultaniman self-assigned this Mar 4, 2024
Copy link

netlify bot commented Mar 4, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit f428b25
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/65e5e3eb97ffab00081467e7


load_info = pipeline.run(data, table_name="users")

pipeline.run([load_info], table_name="_load_info")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. please add load_info, normalize_info and extract_info.
  2. please load to a separate schema (you have schema argument to run()
  3. use this second schema to compare hashes (mind that it wont be a default)

@@ -61,8 +61,12 @@ class _StepInfo(NamedTuple):
class StepMetrics(TypedDict):
"""Metrics for particular package processed in particular pipeline step"""

load_id: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is OK and overall even better than a dictionary. But in general the shape of the data is changed in asdict() like here:

def asdict(self) -> DictStrAny:
        # to be mixed with NamedTuple
        d: DictStrAny = self._asdict()  # type: ignore
        d["pipeline"] = {"pipeline_name": self.pipeline.pipeline_name}
        d["load_packages"] = [package.asdict() for package in self.load_packages]
        if self.metrics:
            d["started_at"] = self.started_at
            d["finished_at"] = self.finished_at
        return d

and the problem was that we didn't reformat metric to convert form dict to list.

dataset_name="mydata",
)

load_info = pipeline.run(data, table_name="users")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please load something more complicated than this. ie. source with a resource that have several hints

pipeline.run([load_info], table_name="_load_info")
first_version_hash = pipeline.default_schema.version_hash

load_info = pipeline.run(data, table_name="users")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here let's load again but we should add another source with different resource and some schema hints

first_version_hash = pipeline.default_schema.version_hash

load_info = pipeline.run(data, table_name="users")
pipeline.run([load_info], table_name="_load_info")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may have a schema difference when loading extract_info and the new resource has a new hint type. then indeed we may add column dynamically.

@sultaniman sultaniman closed this Mar 5, 2024
@sultaniman sultaniman deleted the issue-1043 branch March 5, 2024 12:27
@sultaniman
Copy link
Contributor Author

closing in favor of #1051

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community This issue came from slack community workspace
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants