Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick fix to serialize load metrics as list instead of a dictionary #1051

Merged
merged 22 commits into from
Mar 5, 2024

Conversation

sultaniman
Copy link
Contributor

@sultaniman sultaniman commented Mar 5, 2024

This issue was reported by a community member in slack and related issue #1043

When we capture load_info data in the destination database the following occurs:

  • Pipeline execution slows down significantly after x number of incremental pipeline runs (it is back to the original speed when load_info is not captured in the database). My specific job (sourcing data from rest API capturing it in MotherDuck) slows down from 1.5 minutes to over 10 minutes (after a few hundred runs) and seems to keep getting slower.
  • Each pipeline run creates a new table with one record (at least in my simple pipeline) with names like _load_info__metrics___1709356777_4431682.

TODO

  • Create 2 separate sources to load
    1. One has resource w/ primary key
    2. Another with two resources
    3. load both sources
  • load the first source
  • load extract, normalize and load info
  • run pipeline with schema parameter

@sultaniman sultaniman requested review from sh-rp and rudolfix March 5, 2024 11:00
@sultaniman sultaniman self-assigned this Mar 5, 2024
Copy link

netlify bot commented Mar 5, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 4839541
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/65e772b45d08cc00086460d4

@sultaniman sultaniman marked this pull request as ready for review March 5, 2024 14:29
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We do not need another test file. Please add it to pipelines tests.
  2. we do not use faker (yet). please revert lock and pyproject for now
  3. why those primary keys are added to run? you have them in the resources

dlt/common/configuration/providers/airflow.py Outdated Show resolved Hide resolved
tests/pipeline/test_pipeline_load_info.py Outdated Show resolved Hide resolved
tests/pipeline/test_pipeline_load_info.py Outdated Show resolved Hide resolved
tests/pipeline/test_pipeline_load_info.py Outdated Show resolved Hide resolved
tests/pipeline/test_pipeline_load_info.py Outdated Show resolved Hide resolved
@sultaniman sultaniman requested a review from rudolfix March 5, 2024 17:16
rudolfix
rudolfix previously approved these changes Mar 5, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now,

@sultaniman

  1. there were bugs in test (loading tables into trace schema)
  2. there was an inconsistent way to replace schema content (bug got fixed)

@rudolfix rudolfix force-pushed the quickfix-issue-1043 branch from 6a7fc08 to 4839541 Compare March 5, 2024 19:29
@rudolfix rudolfix merged commit 23d5222 into devel Mar 5, 2024
40 of 41 checks passed
@rudolfix rudolfix deleted the quickfix-issue-1043 branch March 5, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants