`iceberg` table format support for `filesystem` destination #2067

jorritsandbrink · 2024-11-15T08:00:26Z

closes #1996

- mypy upgrade needed to solve this issue: apache/iceberg-python#768 - uses <1.13.0 requirement on mypy because 1.13.0 gives error - new lint errors arising due to version upgrade are simply ignored

…-iceberg-filesystem

netlify · 2024-11-15T08:06:53Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`87553a6`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/6736ff9d08a5540009fbc3a7

netlify · 2024-11-15T08:07:16Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`accb62d`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/6758dde83570910008ce95d6
😎 Deploy Preview	https://deploy-preview-2067--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

pyproject.toml

dlt/common/configuration/specs/aws_credentials.py

dlt/common/libs/pyiceberg.py

tests/load/pipeline/test_filesystem_pipeline.py

jorritsandbrink · 2024-11-16T14:36:28Z

@rudolfix / @sh-rp this PR isn't there yet but can you give an intermediate review?

What still needs to happen:

Azure/GCS support
partitioning
add iceberg to more of the existing delta tests
probably some other things

Notes:

~~I suggest we try adding partitioning in a follow-up PR. I looked at it but couldn't easily implement it:~~
~~1. confusing API~~
~~2. lack of documentation~~
~~3. incorrect documentation~~
~~4. bugs (or I am doing something wrong, which is not unlikely given 1/2/3)~~
Strikethrough note is outdated—partition support is included in this PR.
I faced some issues with dependency management; see pyproject.toml.

rudolfix

IMO this is almost good.

we need to decide if we want to save per table catalog.
I'd enable iceberg scanner in duckdb/sql_client in filesystem. mostly to start running the same tests that use delta
maybe a refactor of get_catalog I mentioned.

tests/load/pipeline/test_filesystem_pipeline.py

dlt/common/configuration/specs/aws_credentials.py

dlt/common/libs/pyiceberg.py

rudolfix · 2024-11-21T16:32:34Z

dlt/common/libs/pyiceberg.py

+    """Returns single-table, ephemeral, in-memory Iceberg catalog."""
+
+    # create in-memory catalog
+    catalog = SqlCatalog(


NOTE: how we get a catalog should be some kind of plugin. so we can easily plug glue or rest to filesystem

dlt/common/libs/pyiceberg.py

rudolfix

code is good and ready to merge. we also have sufficient tests. we still need a little bit better docs.

pyiceberg.io:__init__.py:348 Defaulting to PyArrow FileIO maybe we could hide this log? we set log levels during tests in pytest_configure
~~there are some tests that do not work in https://github.com/dlt-hub/dlt/actions/runs/12096164907/job/33730150792?pr=2067#step:8:4106 pls take a look~~ (they are coming from somewhere else)
we need help with improving the docs. my take would be to create a separate destination Delta / Iceberg where we could move most of the docs out of the filesystem WDYT?

for both table formats you could write shortly how they are implemented
in case of iceberg we also should describe how we write tables without catalog and mention the limitations (single writer!)
we have a page where we enumerate table formats. I'll write some kind of introduction there next week

dlt/destinations/impl/filesystem/filesystem.py

rudolfix · 2024-11-30T20:08:02Z

dlt/destinations/impl/filesystem/sql_client.py

+            elif schema_table.get("table_format") == "iceberg":
+                from dlt.common.libs.pyiceberg import _get_last_metadata_file
+
+                self._setup_iceberg(self._conn)


isn't the latest version of duckdb iceberg working with it?

Both issues mentioned in _setup_iceberg are still open. I just tried with duckdb=1.1.3, and it fails without _setup_iceberg.

…-iceberg-filesystem

jorritsandbrink · 2024-12-02T06:27:49Z

@rudolfix I've addressed your feedback. Please have a look.

Only thing I don't understand are these failing tests:

FAILED tests/load/filesystem/test_credentials_mixins.py::test_gcp_credentials_mixins[WithObjectStoreRsCredentials-gs] - dlt.common.configuration.exceptions.ConfigValueCannotBeCoercedException: Configured value for field scopes cannot be coerced into type <class 'list'>
FAILED tests/load/filesystem/test_credentials_mixins.py::test_gcp_credentials_mixins[WithPyicebergConfig-gs] - dlt.common.configuration.exceptions.ConfigValueCannotBeCoercedException: Configured value for field scopes cannot be coerced into type <class 'list'>
FAILED tests/load/filesystem/test_gcs_credentials.py::test_explicit_filesystem_credentials - dlt.common.configuration.exceptions.ConfigValueCannotBeCoercedException: Configured value for field scopes cannot be coerced into type <class 'list'>

I can't reproduce them, it works on my machine.

It's strange because, looking at the logs, it looks as if deserialize_value is called with key = 'scopes', value = 'None', hint = <class 'list'>. Logically, this cannot happen because that function is only called if value is not None.

def _resolve_config_field(
   ...
        # if value is resolved, then deserialize and coerce it
        if value is not None:
            # do not deserialize explicit values
            if value is not explicit_value:
                value = deserialize_value(key, value, inner_hint)

I'm probably overlooking something. Do you know what's going on?

rudolfix · 2024-12-02T08:59:54Z

@jorritsandbrink that got fixed on another branch yesterday. so we are good. for some reason docs linting is not passing. pls check it out

jorritsandbrink · 2024-12-02T10:07:53Z

@jorritsandbrink that got fixed on another branch yesterday. so we are good. for some reason docs linting is not passing. pls check it out

@rudolfix fixed it by deleting chess_games folder in S3.

Error was:

<class 'dlt.common.schema.exceptions.SchemaEngineNoUpgradePathException'>
E           In schema: chess_com_source: No engine upgrade path in schema chess_com_source from 11 to 10, stopped at 11. You possibly tried to run an older dlt version against a destination you have previously loaded data to with a newer dlt version.

on _sync_destination in docs/examples/partial_loading/test_partial_loading.py::test_partial_loading.

Don't think that error came from this branch.

…-iceberg-filesystem

rudolfix · 2024-12-03T12:52:26Z

@jorritsandbrink now it all LGTM! just here:
https://github.com/dlt-hub/dlt/actions/runs/12125961933/job/33807050441?pr=2067#step:12:9511

I think we are importing iceberg to aggressively into filesystem. we should allow it to work both without delta and iceberg. please check it out!

jorritsandbrink · 2024-12-03T13:37:57Z

@rudolfix I don't think we're importing pyiceberg too aggressively. Those errors arose because pyiceberg was accidentally removed from pyproject.toml when merging the latest version of devel. It has already been fixed and the latest checks don't contain those errors.

…-hub/dlt into feat/1996-iceberg-filesystem

* add pyiceberg dependency and upgrade mypy - mypy upgrade needed to solve this issue: apache/iceberg-python#768 - uses <1.13.0 requirement on mypy because 1.13.0 gives error - new lint errors arising due to version upgrade are simply ignored * extend pyiceberg dependencies * remove redundant delta annotation * add basic local filesystem iceberg support * add active table format setting * disable merge tests for iceberg table format * restore non-redundant extra info * refactor to in-memory iceberg catalog * add s3 support for iceberg table format * add schema evolution support for iceberg table format * extract _register_table function * add partition support for iceberg table format * update docstring * enable child table test for iceberg table format * enable empty source test for iceberg table format * make iceberg catalog namespace configurable and default to dataset name * add optional typing * fix typo * improve typing * extract logic into dedicated function * add iceberg read support to filesystem sql client * remove unused import * add todo * extract logic into separate functions * add azure support for iceberg table format * generalize delta table format tests * enable get tables function test for iceberg table format * remove ignores * undo table directory management change * enable test_read_interfaces tests for iceberg * fix active table format filter * use mixin for object store rs credentials * generalize catalog typing * extract pyiceberg scheme mapping into separate function * generalize credentials mixin test setup * remove unused import * add centralized fallback to append when merge is not supported * Revert "add centralized fallback to append when merge is not supported" This reverts commit 54cd0bc. * fall back to append if merge is not supported on filesystem * fix test for s3-compatible storage * remove obsolete code path * exclude gcs read interface tests for iceberg * add gcs support for iceberg table format * switch to UnsupportedAuthenticationMethodException * add iceberg table format docs * use shorter pipeline name to prevent too long sql identifiers * add iceberg catalog note to docs * black format * use shorter pipeline name to prevent too long sql identifiers * correct max id length for sqlalchemy mysql dialect * Revert "use shorter pipeline name to prevent too long sql identifiers" This reverts commit 6cce03b. * Revert "use shorter pipeline name to prevent too long sql identifiers" This reverts commit ef29aa7. * replace show with execute to prevent useless print output * add abfss scheme to test * remove az support for iceberg table format * remove iceberg bucket test exclusion * add note to docs on azure scheme support for iceberg table format * exclude iceberg from duckdb s3-compatibility test * disable pyiceberg info logs for tests * extend table format docs and move into own page * upgrade adlfs to enable account_host attribute * Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1996-iceberg-filesystem * fix lint errors * re-add pyiceberg dependency * enabled iceberg in dbt-duckdb * upgrade pyiceberg version * remove pyiceberg mypy errors across python version * does not install airflow group for dev * fixes gcp oauth iceberg credentials handling * fixes ca cert bundle duckdb azure on ci * allow for airflow dep to be present during type check --------- Co-authored-by: Marcin Rudolf <[email protected]>

jorritsandbrink added 5 commits November 14, 2024 15:53

add pyiceberg dependency and upgrade mypy

79c018c

- mypy upgrade needed to solve this issue: apache/iceberg-python#768 - uses <1.13.0 requirement on mypy because 1.13.0 gives error - new lint errors arising due to version upgrade are simply ignored

extend pyiceberg dependencies

5014f88

remove redundant delta annotation

c632dd7

add basic local filesystem iceberg support

a3f6587

add active table format setting

87553a6

jorritsandbrink linked an issue Nov 15, 2024 that may be closed by this pull request

write iceberg tables on filesystem destination #1996

Closed

Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1996…

513662e

…-iceberg-filesystem

jorritsandbrink added 6 commits November 15, 2024 13:15

disable merge tests for iceberg table format

10121be

restore non-redundant extra info

23c4db3

refactor to in-memory iceberg catalog

195ee4c

add s3 support for iceberg table format

ee6e22e

add schema evolution support for iceberg table format

bc51008

extract _register_table function

2de58a2

jorritsandbrink commented Nov 16, 2024

View reviewed changes

pyproject.toml Show resolved Hide resolved

jorritsandbrink commented Nov 16, 2024

View reviewed changes

dlt/common/configuration/specs/aws_credentials.py Show resolved Hide resolved

jorritsandbrink commented Nov 16, 2024

View reviewed changes

dlt/common/libs/pyiceberg.py Show resolved Hide resolved

jorritsandbrink commented Nov 16, 2024

View reviewed changes

tests/load/pipeline/test_filesystem_pipeline.py Outdated Show resolved Hide resolved

jorritsandbrink requested review from rudolfix and sh-rp November 16, 2024 14:36

sh-rp assigned jorritsandbrink Nov 19, 2024

jorritsandbrink added 3 commits November 20, 2024 23:12

add partition support for iceberg table format

dd4ad0f

update docstring

04be59b

enable child table test for iceberg table format

42f59c7

rudolfix reviewed Nov 21, 2024

View reviewed changes

jorritsandbrink added 3 commits November 21, 2024 23:05

enable empty source test for iceberg table format

a540135

make iceberg catalog namespace configurable and default to dataset name

3d1dc63

add optional typing

59e6d08

add note to docs on azure scheme support for iceberg table format

049c008

rudolfix requested changes Nov 30, 2024

View reviewed changes

jorritsandbrink added 5 commits December 1, 2024 12:23

exclude iceberg from duckdb s3-compatibility test

a0fc017

Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1996…

ba75445

…-iceberg-filesystem

disable pyiceberg info logs for tests

de0086e

extend table format docs and move into own page

ca7f655

upgrade adlfs to enable account_host attribute

2ba8fcb

jorritsandbrink added 4 commits December 2, 2024 22:45

Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1996…

0517a95

…-iceberg-filesystem

Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1996…

1c2b9b4

…-iceberg-filesystem

fix lint errors

872432e

re-add pyiceberg dependency

9c44290

rudolfix and others added 10 commits December 6, 2024 12:33

enabled iceberg in dbt-duckdb

c129b9e

upgrade pyiceberg version

6992d56

Merge branch 'feat/1996-iceberg-filesystem' of https://github.com/dlt…

156d518

…-hub/dlt into feat/1996-iceberg-filesystem

remove pyiceberg mypy errors across python version

aa19f13

Merge branch 'devel' into feat/1996-iceberg-filesystem

c07c9f6

does not install airflow group for dev

b7f6dbf

fixes gcp oauth iceberg credentials handling

9cad3ec

fixes ca cert bundle duckdb azure on ci

346b270

Merge branch 'devel' into feat/1996-iceberg-filesystem

08f8ee3

allow for airflow dep to be present during type check

accb62d

rudolfix merged commit 4e5a240 into devel Dec 11, 2024
57 of 59 checks passed

rudolfix deleted the feat/1996-iceberg-filesystem branch December 11, 2024 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`iceberg` table format support for `filesystem` destination #2067

`iceberg` table format support for `filesystem` destination #2067

jorritsandbrink commented Nov 15, 2024

netlify bot commented Nov 15, 2024 •

edited

Loading

netlify bot commented Nov 15, 2024 •

edited

Loading

jorritsandbrink commented Nov 16, 2024 •

edited

Loading

rudolfix left a comment

rudolfix Nov 21, 2024

rudolfix left a comment •

edited

Loading

rudolfix Nov 30, 2024

jorritsandbrink Dec 1, 2024

jorritsandbrink commented Dec 2, 2024

rudolfix commented Dec 2, 2024

jorritsandbrink commented Dec 2, 2024

rudolfix commented Dec 3, 2024

jorritsandbrink commented Dec 3, 2024

iceberg table format support for filesystem destination #2067

iceberg table format support for filesystem destination #2067

Conversation

jorritsandbrink commented Nov 15, 2024

netlify bot commented Nov 15, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

netlify bot commented Nov 15, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

jorritsandbrink commented Nov 16, 2024 • edited Loading

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix Nov 21, 2024

Choose a reason for hiding this comment

rudolfix left a comment • edited Loading

Choose a reason for hiding this comment

rudolfix Nov 30, 2024

Choose a reason for hiding this comment

jorritsandbrink Dec 1, 2024

Choose a reason for hiding this comment

jorritsandbrink commented Dec 2, 2024

rudolfix commented Dec 2, 2024

jorritsandbrink commented Dec 2, 2024

rudolfix commented Dec 3, 2024

jorritsandbrink commented Dec 3, 2024

`iceberg` table format support for `filesystem` destination #2067

`iceberg` table format support for `filesystem` destination #2067

netlify bot commented Nov 15, 2024 •

edited

Loading

netlify bot commented Nov 15, 2024 •

edited

Loading

jorritsandbrink commented Nov 16, 2024 •

edited

Loading

rudolfix left a comment •

edited

Loading