-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/1571 Incremental: Optionally load or ignore records with cursor_path missing or None value #1576
Fix/1571 Incremental: Optionally load or ignore records with cursor_path missing or None value #1576
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
09001ab
to
858ec47
Compare
dff327d
to
2e67119
Compare
2e67119
to
2588d7f
Compare
2588d7f
to
95db6c7
Compare
@willi-mueller this is an important feature, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willi-mueller incremental is a minefield ;> please read my comments. we can simplify code in json transform but there are edge cases in arrow. please fix what you can and we'll work on the tests
Thank you @VioletM. You raise an important question! Yes, each time the None values appears in the source it will be processed. My proposal would be to use What do you think about the proposed changes in docs/website/docs/general-usage/incremental-loading.md proposed here: #1650? How can we mention your point there too? |
…ow. Fails for pandas & pyarrow implements handling of cursor_path value is None for nested JSON paths
…alue and is_above_end_value
…al function is called
…ow_initial_value and is_above_end_value This reverts commit 733d1c0.
… Fails for arrow-batch
7feef42
to
ca7b91c
Compare
@rudolfix Currently, the following test is failing: The reason is that before this PR, the following exception was ignored: However, in this PR, I wanted to:
Do you think these two points are appropriate? Just to let the test suite pass, I restored the old behavior in 06aff12 passing any exception and using a run-time type check to decide whether to use keys() or Further, the test case is concerned with how dlt handles traces and uses the incremental only to produce a certain trace and it is not testing the incremental implementation. What would you advise? |
…::test_dump_trace_freeze_exception
2ddcd1b
to
06aff12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of like it now! Please see the comments. thanks for all tests and docs!
589b592
to
0b267a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
… with cursor_path missing or None value (#1576) * allows specification of what happens on cursor_path missing or cursor_path having the value None: raise differentiated exceptions, exclude row, or include row. * Documents handling None values at the incremental cursor * fixes incremental extract crashing if one record has cursor_path = None * test that add_map can be used to transform items before the incremental function is called * Unifies treating of None values for python Objects (including pydantic), pandas, and arrow --------- Co-authored-by: Marcin Rudolf <[email protected]>
… with cursor_path missing or None value (#1576) * allows specification of what happens on cursor_path missing or cursor_path having the value None: raise differentiated exceptions, exclude row, or include row. * Documents handling None values at the incremental cursor * fixes incremental extract crashing if one record has cursor_path = None * test that add_map can be used to transform items before the incremental function is called * Unifies treating of None values for python Objects (including pydantic), pandas, and arrow --------- Co-authored-by: Marcin Rudolf <[email protected]>
Description
This PR:
Allows users to specify what happens when the value at the incremental cursor path is
None
or the field is missing in a row. It also unifies the handling of null values in pandas/arrow with python objects.Consider the following example data where
created_at
is the incremental cursor path:The options are:
incremental(..., on_cursor_value_missing="raise")
. This will raiseIncrementalCursorPathHasValueNone
fordata_1
andIncrementalCursorPathMissing
fordata_2
.incremental(..., on_cursor_value_missing="include")
. This will load all rows for bothdata_1
anddata_2
respectively.incremental(..., on_cursor_value_missing="exclude")
. This will load only the first row for bothdata_1
anddata_2
respectively.This PR also adds documentation on how to load data with
None
at the cursor path incrementallyAll outlined features are implemented and tested for all 4 data formats (object, pandas, arrow-table, arrow batch). However, JSON path cursors are still only supported for JSON objects but not for arrow and pandas.
Done in collaboration with @francescomucio
TODO
add_map()
to add default valuesRelated Issues