Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose staging tables truncation to config #1717

Merged
merged 6 commits into from
Aug 28, 2024

Conversation

VioletM
Copy link
Contributor

@VioletM VioletM commented Aug 22, 2024

No description provided.

@VioletM VioletM requested a review from sh-rp August 22, 2024 07:45
Copy link

netlify bot commented Aug 22, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit e65f853
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66cee588ece17c00085a88d6
😎 Deploy Preview https://deploy-preview-1717--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good already!

@VioletM VioletM force-pushed the feat/1603-prevent-files-removal-on-staging branch from da24fa4 to 82b1a06 Compare August 23, 2024 18:38
@VioletM VioletM linked an issue Aug 26, 2024 that may be closed by this pull request
@VioletM VioletM force-pushed the feat/1603-prevent-files-removal-on-staging branch 6 times, most recently from 63927de to 14b74ec Compare August 26, 2024 13:31
@VioletM VioletM marked this pull request as ready for review August 26, 2024 15:02
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a proper test (look at test_stage_loading.py):

  1. make sure that staging destination is not truncated (flag at default) and truncated (flag to false). at the same time make sure that destination tables are NOT truncated
  2. exclude Athena if not force iceberg (or make sure that staging destination was not truncated)
  3. use some kind of simple dataset (not github - takes a lot of time)

@VioletM feel free to push this task to @rudolfix or @sh-rp or someone else...

dlt/common/destination/reference.py Outdated Show resolved Hide resolved
dlt/common/destination/reference.py Outdated Show resolved Hide resolved
dlt/destinations/impl/athena/athena.py Outdated Show resolved Hide resolved
dlt/destinations/impl/athena/athena.py Outdated Show resolved Hide resolved
dlt/destinations/impl/bigquery/bigquery.py Outdated Show resolved Hide resolved
dlt/destinations/impl/dummy/configuration.py Show resolved Hide resolved
@VioletM VioletM force-pushed the feat/1603-prevent-files-removal-on-staging branch from 14b74ec to 46e4117 Compare August 27, 2024 14:11
@VioletM
Copy link
Contributor Author

VioletM commented Aug 27, 2024

@rudolfix

we need a proper test (look at test_stage_loading.py):

Added a test which runs on all staging configurations, taking about 10min, which is quite a lot. I can reduce the amount of all combinations between destinations and bucket types. It's nice that we're testing many combinations, but not necessary. Let me know if you think the test needs to be changed or even if you prefer to do it yourself.

  1. make sure that staging destination is not truncated (flag at default) and truncated (flag to false). at the same time make sure that destination tables are NOT truncated

I think we should left default behavior as is -- truncated (flag to True by default). Otherwise, with the update users start to see increase in the staging usage

@VioletM VioletM requested review from sh-rp and rudolfix August 27, 2024 14:22
rudolfix
rudolfix previously approved these changes Aug 27, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very good! thanks for the tests and docs update. your staging test is not marked as essential (OK!) so it won't run on a few destinations. nevertheless we can merge it and fix any remaining tests just before release

(I expect that athena non iceberg will fail here)

@VioletM
Copy link
Contributor Author

VioletM commented Aug 28, 2024

(I expect that athena non iceberg will fail here)

Hm, but this test uses destinations_configs(all_staging_configs=True) and there is no athena config present there

@@ -96,4 +100,21 @@ In essence, you need to set up two destinations and then pass them to `dlt.pipel

Run the pipeline script as usual.

> 💡 Please note that `dlt` does not delete loaded files from the staging storage after the load is complete.
:::tip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd run this whole new section through chatgpt to grammar correct (or run the grammar checker on this page)

# check there are two staging files
_, staging_client = pipeline._get_destination_clients(pipeline.default_schema)
with staging_client:
assert len(staging_client.list_table_files(table_name)) == 2 # type: ignore[attr-defined]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for later: we should probably allow fs_client on the pipeline to also return the staging filesystem client with a flag.

@@ -503,6 +503,9 @@ def _should_autodetect_schema(self, table_name: str) -> bool:
self.schema._schema_tables, table_name, AUTODETECT_SCHEMA_HINT, allow_none=True
) or (self.config.autodetect_schema and table_name not in self.schema.dlt_table_names())

def should_truncate_table_before_load_on_staging_destination(self, table: TTableSchema) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imho having the same exact line in all implementations is not DRY. We can keep it like this for now, but I would rather implement this in the superclass, probably with hasattr and isinstance to get the config and verify the type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you see the previous version? at least here this is simple and mypy forces you to do overrides. previously you had to setup cooperative calling of super() and keep state in the class that should be an interface/trait. also if you forgot about that, the config would have no effect and you'll never know about it

"destination_config", destinations_configs(all_staging_configs=True), ids=lambda x: x.name
)
def test_truncate_staging_dataset(destination_config: DestinationTestConfiguration) -> None:
"""This test checks if tables truncation on staging destination done according to the configuration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is good, but to make it great we should also test wether keeping the staging files around will make the data be loaded again although it shouldn't. but for now i'd say it's good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, should be easy to do

sh-rp
sh-rp previously approved these changes Aug 28, 2024
Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please run the grammar checker on the docs and then you can merge.

@VioletM VioletM dismissed stale reviews from sh-rp and rudolfix via e65f853 August 28, 2024 08:53
@rudolfix rudolfix merged commit 98ca505 into devel Aug 28, 2024
55 of 56 checks passed
@rudolfix rudolfix deleted the feat/1603-prevent-files-removal-on-staging branch August 28, 2024 10:28
willi-mueller pushed a commit that referenced this pull request Sep 2, 2024
* Expose staging tables truncation to config

* Fix comments, add tests

* Fix tests

* Move implementation from mixing, add tests

* Fix docs grammar
willi-mueller pushed a commit that referenced this pull request Sep 2, 2024
* Expose staging tables truncation to config

* Fix comments, add tests

* Fix tests

* Move implementation from mixing, add tests

* Fix docs grammar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Toggle to Prevent Deletion of Files Loaded into the Data Warehouse
3 participants