Raise/warn on incomplete columns in normalize #1504

steinitzu · 2024-06-21T01:40:21Z

Description

Turns the "unbound column" warning into an exception for not-null columns and move it to normalize

Related Issues

Fixes Wrong Merge Key Not Throwing Error #1463

Additional Context

netlify · 2024-06-21T01:40:38Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`c1e2c85`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/66b3a4b0f38083000809eb94

sh-rp · 2024-06-24T11:54:54Z

I'm wondering if we should not do this in the extraction step already. All columns that are non-nullable (and merge and primary keys should be that) should raise if not populated. Extraction spends time on I/O mostly and not on python code as the normalizer, so the check would not make a big difference in performance in my opinion.

steinitzu · 2024-06-24T13:55:51Z

I'm wondering if we should not do this in the extraction step already. All columns that are non-nullable (and merge and primary keys should be that) should raise if not populated. Extraction spends time on I/O mostly and not on python code as the normalizer, so the check would not make a big difference in performance in my opinion.

I agree it would be much better to fail early if possible. Ideally we could tell right after the first data item is extracted.
But I wasn't sure if we can always tell whether the column is populated in extract. The "seen data" marker for the table is only set in normalize so I was going by that. But I'll give it a try.

dlt/common/schema/exceptions.py

sh-rp · 2024-06-26T14:04:32Z

tests/load/pipeline/test_merge_disposition.py

@@ -989,3 +989,24 @@ def r():
    with pytest.raises(PipelineStepFailed) as pip_ex:
        p.run(r())
    assert isinstance(pip_ex.value.__context__, SchemaException)
+
+
+@pytest.mark.parametrize(


Can you write a test (or check if one exists) to see what happens when we do a merge on merge keys but some rows have null in the merge key? It's not super important right now, but if it would be interesting to know what happens :)

I couldn't find a test so I added one. This was raising an exception already through schema.coerce_row in normalize

sh-rp

Looks good, small requests

steinitzu · 2024-06-26T23:56:11Z

I'm wondering if we should not do this in the extraction step already. All columns that are non-nullable (and merge and primary keys should be that) should raise if not populated. Extraction spends time on I/O mostly and not on python code as the normalizer, so the check would not make a big difference in performance in my opinion.

Was looking into if this was possible also, but I don't think so without moving a lot of normalize logic into extract. I wasn't sure how much schema inferrence is done in extract, seems there is none.

rudolfix · 2024-08-06T11:14:16Z

@steinitzu this needs merge from devel. we did a lot of updates. @sh-rp otherwise this PR is good to go?

Raise on not-nullable columns to catch e.g. misspelled merge/primary key key

steinitzu · 2024-08-08T00:30:30Z

@steinitzu this needs merge from devel. we did a lot of updates. @sh-rp otherwise this PR is good to go?

branch is up to date now and tests passing

rudolfix

LGTM!

sh-rp reviewed Jun 26, 2024

View reviewed changes

dlt/common/schema/exceptions.py Outdated Show resolved Hide resolved

sh-rp reviewed Jun 26, 2024

View reviewed changes

sh-rp requested changes Jun 26, 2024

View reviewed changes

sh-rp added sprint Marks group of tasks with core team focus at this moment labels Jun 26, 2024

steinitzu force-pushed the fix/error-missing-merge-key branch 2 times, most recently from d16217f to a855f32 Compare June 26, 2024 23:49

steinitzu force-pushed the fix/error-missing-merge-key branch from 1db75e1 to 0d0afa5 Compare June 27, 2024 18:47

steinitzu marked this pull request as ready for review June 27, 2024 22:06

rudolfix removed the sprint Marks group of tasks with core team focus at this moment label Jul 3, 2024

steinitzu added 6 commits August 7, 2024 10:06

Raise/warn on incomplete columns in normalize

94f8a46

Raise on not-nullable columns to catch e.g. misspelled merge/primary key key

Update error msg

9a24a1c

Test for null values

17effad

Lint

2545101

Delete now invalid tests

d610aff

Fix common test

c1e2c85

steinitzu force-pushed the fix/error-missing-merge-key branch from 7f36f97 to c1e2c85 Compare August 7, 2024 16:45

rudolfix approved these changes Aug 9, 2024

View reviewed changes

rudolfix merged commit 61ab997 into devel Aug 9, 2024
53 of 54 checks passed

rudolfix deleted the fix/error-missing-merge-key branch August 9, 2024 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise/warn on incomplete columns in normalize #1504

Raise/warn on incomplete columns in normalize #1504

steinitzu commented Jun 21, 2024

netlify bot commented Jun 21, 2024 •

edited

Loading

sh-rp commented Jun 24, 2024

steinitzu commented Jun 24, 2024

sh-rp Jun 26, 2024

steinitzu Jun 26, 2024

sh-rp left a comment

steinitzu commented Jun 26, 2024

rudolfix commented Aug 6, 2024

steinitzu commented Aug 8, 2024

rudolfix left a comment

Raise/warn on incomplete columns in normalize #1504

Raise/warn on incomplete columns in normalize #1504

Conversation

steinitzu commented Jun 21, 2024

Description

Related Issues

Additional Context

netlify bot commented Jun 21, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

sh-rp commented Jun 24, 2024

steinitzu commented Jun 24, 2024

sh-rp Jun 26, 2024

Choose a reason for hiding this comment

steinitzu Jun 26, 2024

Choose a reason for hiding this comment

sh-rp left a comment

Choose a reason for hiding this comment

steinitzu commented Jun 26, 2024

rudolfix commented Aug 6, 2024

steinitzu commented Aug 8, 2024

rudolfix left a comment

Choose a reason for hiding this comment

netlify bot commented Jun 21, 2024 •

edited

Loading