Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/1571 Incremental: Optionally load or ignore records with cursor_path missing or None value #1576

Merged

Conversation

willi-mueller
Copy link
Collaborator

@willi-mueller willi-mueller commented Jul 10, 2024

Description

This PR:
Allows users to specify what happens when the value at the incremental cursor path is None or the field is missing in a row. It also unifies the handling of null values in pandas/arrow with python objects.

Consider the following example data where created_at is the incremental cursor path:

data_1 = [
  {"a": 1, "created_at": 1},
  {"a": 2, "created_at": None},
]

data_2 = [
  {"a": 1, "created_at": 1},
  {"a": 2},
]

The options are:

  1. incremental(..., on_cursor_value_missing="raise"). This will raise IncrementalCursorPathHasValueNone for data_1 and IncrementalCursorPathMissing for data_2.
  2. incremental(..., on_cursor_value_missing="include"). This will load all rows for both data_1 and data_2 respectively.
  3. incremental(..., on_cursor_value_missing="exclude"). This will load only the first row for both data_1 and data_2 respectively.

This PR also adds documentation on how to load data with None at the cursor path incrementally

All outlined features are implemented and tested for all 4 data formats (object, pandas, arrow-table, arrow batch). However, JSON path cursors are still only supported for JSON objects but not for arrow and pandas.

Done in collaboration with @francescomucio

TODO

  • docs: explain new parameter
  • docs: explain how we can leverage add_map() to add default values

Related Issues

Copy link

netlify bot commented Jul 10, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 4ce384a
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66d1b5e64a6fec0008236e69
😎 Deploy Preview https://deploy-preview-1576--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@willi-mueller willi-mueller force-pushed the fix/1571-tolerate-record-without-incremental-cursor-path branch from 09001ab to 858ec47 Compare July 12, 2024 12:57
@willi-mueller willi-mueller changed the title Fix/1571 tolerate record without incremental cursor path Fix/1571 tolerate record with None value at incremental cursor path Jul 12, 2024
@willi-mueller willi-mueller force-pushed the fix/1571-tolerate-record-without-incremental-cursor-path branch from dff327d to 2e67119 Compare July 22, 2024 13:52
@VioletM VioletM added the community This issue came from slack community workspace label Jul 25, 2024
@willi-mueller willi-mueller force-pushed the fix/1571-tolerate-record-without-incremental-cursor-path branch from 2e67119 to 2588d7f Compare July 25, 2024 08:59
@willi-mueller willi-mueller marked this pull request as ready for review July 25, 2024 09:00
@willi-mueller willi-mueller force-pushed the fix/1571-tolerate-record-without-incremental-cursor-path branch from 2588d7f to 95db6c7 Compare July 25, 2024 09:09
@willi-mueller willi-mueller requested review from rudolfix and sh-rp July 25, 2024 09:12
@willi-mueller willi-mueller changed the title Fix/1571 tolerate record with None value at incremental cursor path Fix/1571 Optionally load records with None value at incremental cursor path Jul 25, 2024
@VioletM
Copy link
Contributor

VioletM commented Jul 29, 2024

@willi-mueller this is an important feature, thank you!
I have a question: records with None value at the incremental cursor will be re-loaded each time, right? Could maybe some fall-back cursor be specified in this case?

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willi-mueller incremental is a minefield ;> please read my comments. we can simplify code in json transform but there are edge cases in arrow. please fix what you can and we'll work on the tests

dlt/extract/incremental/__init__.py Outdated Show resolved Hide resolved
dlt/extract/incremental/__init__.py Outdated Show resolved Hide resolved
dlt/extract/incremental/transform.py Outdated Show resolved Hide resolved
dlt/extract/incremental/transform.py Outdated Show resolved Hide resolved
dlt/extract/incremental/transform.py Outdated Show resolved Hide resolved
dlt/extract/incremental/typing.py Outdated Show resolved Hide resolved
tests/extract/test_incremental.py Show resolved Hide resolved
tests/extract/test_incremental.py Outdated Show resolved Hide resolved
dlt/extract/incremental/transform.py Outdated Show resolved Hide resolved
@willi-mueller
Copy link
Collaborator Author

willi-mueller commented Jul 30, 2024

@willi-mueller this is an important feature, thank you! I have a question: records with None value at the incremental cursor will be re-loaded each time, right? Could maybe some fall-back cursor be specified in this case?

Thank you @VioletM. You raise an important question! Yes, each time the None values appears in the source it will be processed. My proposal would be to use add_map() to specify the default fallback cursor.

What do you think about the proposed changes in docs/website/docs/general-usage/incremental-loading.md proposed here: #1650?

How can we mention your point there too?

@willi-mueller willi-mueller changed the title Fix/1571 Optionally load records with None value at incremental cursor path Fix/1571 Incremental: Optionally load records with None value at cursor_path Jul 31, 2024
@willi-mueller willi-mueller force-pushed the fix/1571-tolerate-record-without-incremental-cursor-path branch from 7feef42 to ca7b91c Compare August 7, 2024 14:06
@willi-mueller willi-mueller requested a review from rudolfix August 7, 2024 14:07
@willi-mueller willi-mueller self-assigned this Aug 8, 2024
@willi-mueller
Copy link
Collaborator Author

willi-mueller commented Aug 8, 2024

@rudolfix Currently, the following test is failing:
tests/pipeline/test_pipeline_extra.py::test_dump_trace_freeze_exception

The reason is that before this PR, the following exception was ignored: AttributeError: 'TestRow' object has no attribute 'get' when trying to access TestRow.get("id"). Instead, the IncrementalCursorPathMissing was raised.

However, in this PR, I wanted to:

  1. we do not ignore the exception when the record object does not implement __get__().
  2. assume that every row implements keys().

Do you think these two points are appropriate?

Just to let the test suite pass, I restored the old behavior in 06aff12 passing any exception and using a run-time type check to decide whether to use keys() or hasattr().

Further, the test case is concerned with how dlt handles traces and uses the incremental only to produce a certain trace and it is not testing the incremental implementation.

What would you advise?

@willi-mueller willi-mueller force-pushed the fix/1571-tolerate-record-without-incremental-cursor-path branch from 2ddcd1b to 06aff12 Compare August 8, 2024 17:33
@willi-mueller willi-mueller changed the title Fix/1571 Incremental: Optionally load records with None value at cursor_path Fix/1571 Incremental: Optionally load or ignore records with cursor_path missing or None value Aug 9, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of like it now! Please see the comments. thanks for all tests and docs!

dlt/extract/incremental/transform.py Outdated Show resolved Hide resolved
dlt/extract/incremental/transform.py Outdated Show resolved Hide resolved
dlt/extract/incremental/transform.py Outdated Show resolved Hide resolved
tests/extract/test_incremental.py Show resolved Hide resolved
@willi-mueller willi-mueller force-pushed the fix/1571-tolerate-record-without-incremental-cursor-path branch from 589b592 to 0b267a2 Compare August 20, 2024 12:37
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@willi-mueller willi-mueller merged commit 1723faa into devel Aug 30, 2024
56 checks passed
@willi-mueller willi-mueller deleted the fix/1571-tolerate-record-without-incremental-cursor-path branch August 30, 2024 17:31
@willi-mueller willi-mueller restored the fix/1571-tolerate-record-without-incremental-cursor-path branch August 30, 2024 17:32
@willi-mueller willi-mueller deleted the fix/1571-tolerate-record-without-incremental-cursor-path branch August 30, 2024 17:32
willi-mueller added a commit that referenced this pull request Sep 2, 2024
… with cursor_path missing or None value (#1576)

* allows specification of what happens on cursor_path missing or cursor_path having the value None: raise differentiated exceptions, exclude row, or include row.

* Documents handling None values at the incremental cursor

* fixes incremental extract crashing if one record has cursor_path = None

* test that add_map can be used to transform items before the incremental function is called

* Unifies treating of None values for python Objects (including pydantic), pandas, and arrow

---------

Co-authored-by: Marcin Rudolf <[email protected]>
willi-mueller added a commit that referenced this pull request Sep 2, 2024
… with cursor_path missing or None value (#1576)

* allows specification of what happens on cursor_path missing or cursor_path having the value None: raise differentiated exceptions, exclude row, or include row.

* Documents handling None values at the incremental cursor

* fixes incremental extract crashing if one record has cursor_path = None

* test that add_map can be used to transform items before the incremental function is called

* Unifies treating of None values for python Objects (including pydantic), pandas, and arrow

---------

Co-authored-by: Marcin Rudolf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This issue came from slack community workspace
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipeline crashes when single record has None value at incremental cursor path
3 participants