Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise/warn on incomplete columns in normalize #1504

Merged
merged 6 commits into from
Aug 9, 2024

Conversation

steinitzu
Copy link
Collaborator

Description

Turns the "unbound column" warning into an exception for not-null columns and move it to normalize

Related Issues

Additional Context

Copy link

netlify bot commented Jun 21, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit c1e2c85
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66b3a4b0f38083000809eb94

@sh-rp
Copy link
Collaborator

sh-rp commented Jun 24, 2024

I'm wondering if we should not do this in the extraction step already. All columns that are non-nullable (and merge and primary keys should be that) should raise if not populated. Extraction spends time on I/O mostly and not on python code as the normalizer, so the check would not make a big difference in performance in my opinion.

@steinitzu
Copy link
Collaborator Author

I'm wondering if we should not do this in the extraction step already. All columns that are non-nullable (and merge and primary keys should be that) should raise if not populated. Extraction spends time on I/O mostly and not on python code as the normalizer, so the check would not make a big difference in performance in my opinion.

I agree it would be much better to fail early if possible. Ideally we could tell right after the first data item is extracted.
But I wasn't sure if we can always tell whether the column is populated in extract. The "seen data" marker for the table is only set in normalize so I was going by that. But I'll give it a try.

@@ -989,3 +989,24 @@ def r():
with pytest.raises(PipelineStepFailed) as pip_ex:
p.run(r())
assert isinstance(pip_ex.value.__context__, SchemaException)


@pytest.mark.parametrize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a test (or check if one exists) to see what happens when we do a merge on merge keys but some rows have null in the merge key? It's not super important right now, but if it would be interesting to know what happens :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a test so I added one. This was raising an exception already through schema.coerce_row in normalize

Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, small requests

@sh-rp sh-rp added sprint Marks group of tasks with core team focus at this moment labels Jun 26, 2024
@steinitzu steinitzu force-pushed the fix/error-missing-merge-key branch 2 times, most recently from d16217f to a855f32 Compare June 26, 2024 23:49
@steinitzu
Copy link
Collaborator Author

I'm wondering if we should not do this in the extraction step already. All columns that are non-nullable (and merge and primary keys should be that) should raise if not populated. Extraction spends time on I/O mostly and not on python code as the normalizer, so the check would not make a big difference in performance in my opinion.

Was looking into if this was possible also, but I don't think so without moving a lot of normalize logic into extract. I wasn't sure how much schema inferrence is done in extract, seems there is none.

@steinitzu steinitzu force-pushed the fix/error-missing-merge-key branch from 1db75e1 to 0d0afa5 Compare June 27, 2024 18:47
@steinitzu steinitzu marked this pull request as ready for review June 27, 2024 22:06
@rudolfix rudolfix removed the sprint Marks group of tasks with core team focus at this moment label Jul 3, 2024
@rudolfix
Copy link
Collaborator

rudolfix commented Aug 6, 2024

@steinitzu this needs merge from devel. we did a lot of updates. @sh-rp otherwise this PR is good to go?

@steinitzu steinitzu force-pushed the fix/error-missing-merge-key branch from 7f36f97 to c1e2c85 Compare August 7, 2024 16:45
@steinitzu
Copy link
Collaborator Author

@steinitzu this needs merge from devel. we did a lot of updates. @sh-rp otherwise this PR is good to go?

branch is up to date now and tests passing

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rudolfix rudolfix merged commit 61ab997 into devel Aug 9, 2024
53 of 54 checks passed
@rudolfix rudolfix deleted the fix/error-missing-merge-key branch August 9, 2024 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong Merge Key Not Throwing Error
3 participants