Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iceberg table format support for filesystem destination #2067

Merged
merged 79 commits into from
Dec 11, 2024

Conversation

jorritsandbrink
Copy link
Collaborator

closes #1996

- mypy upgrade needed to solve this issue: apache/iceberg-python#768
- uses <1.13.0 requirement on mypy because 1.13.0 gives error
- new lint errors arising due to version upgrade are simply ignored
@jorritsandbrink jorritsandbrink linked an issue Nov 15, 2024 that may be closed by this pull request
Copy link

netlify bot commented Nov 15, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 87553a6
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6736ff9d08a5540009fbc3a7

Copy link

netlify bot commented Nov 15, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit accb62d
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6758dde83570910008ce95d6
😎 Deploy Preview https://deploy-preview-2067--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@jorritsandbrink
Copy link
Collaborator Author

jorritsandbrink commented Nov 16, 2024

@rudolfix / @sh-rp this PR isn't there yet but can you give an intermediate review?

What still needs to happen:

  • Azure/GCS support
  • partitioning
  • add iceberg to more of the existing delta tests
  • probably some other things

Notes:

  • I suggest we try adding partitioning in a follow-up PR. I looked at it but couldn't easily implement it:
    1. confusing API
    2. lack of documentation
    3. incorrect documentation
    4. bugs (or I am doing something wrong, which is not unlikely given 1/2/3)
  • Strikethrough note is outdated—partition support is included in this PR.
  • I faced some issues with dependency management; see pyproject.toml.

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is almost good.

  • we need to decide if we want to save per table catalog.
  • I'd enable iceberg scanner in duckdb/sql_client in filesystem. mostly to start running the same tests that use delta
  • maybe a refactor of get_catalog I mentioned.

tests/load/pipeline/test_filesystem_pipeline.py Outdated Show resolved Hide resolved
dlt/common/libs/pyiceberg.py Outdated Show resolved Hide resolved
"""Returns single-table, ephemeral, in-memory Iceberg catalog."""

# create in-memory catalog
catalog = SqlCatalog(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: how we get a catalog should be some kind of plugin. so we can easily plug glue or rest to filesystem

dlt/common/libs/pyiceberg.py Show resolved Hide resolved
dlt/common/libs/pyiceberg.py Show resolved Hide resolved
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code is good and ready to merge. we also have sufficient tests. we still need a little bit better docs.

  1. pyiceberg.io:__init__.py:348 Defaulting to PyArrow FileIO maybe we could hide this log? we set log levels during tests in pytest_configure
  2. there are some tests that do not work in https://github.com/dlt-hub/dlt/actions/runs/12096164907/job/33730150792?pr=2067#step:8:4106 pls take a look (they are coming from somewhere else)
  3. we need help with improving the docs. my take would be to create a separate destination Delta / Iceberg where we could move most of the docs out of the filesystem WDYT?
  • for both table formats you could write shortly how they are implemented
  • in case of iceberg we also should describe how we write tables without catalog and mention the limitations (single writer!)
  • we have a page where we enumerate table formats. I'll write some kind of introduction there next week

elif schema_table.get("table_format") == "iceberg":
from dlt.common.libs.pyiceberg import _get_last_metadata_file

self._setup_iceberg(self._conn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't the latest version of duckdb iceberg working with it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both issues mentioned in _setup_iceberg are still open. I just tried with duckdb=1.1.3, and it fails without _setup_iceberg.

@jorritsandbrink
Copy link
Collaborator Author

@rudolfix I've addressed your feedback. Please have a look.

Only thing I don't understand are these failing tests:

FAILED tests/load/filesystem/test_credentials_mixins.py::test_gcp_credentials_mixins[WithObjectStoreRsCredentials-gs] - dlt.common.configuration.exceptions.ConfigValueCannotBeCoercedException: Configured value for field scopes cannot be coerced into type <class 'list'>
FAILED tests/load/filesystem/test_credentials_mixins.py::test_gcp_credentials_mixins[WithPyicebergConfig-gs] - dlt.common.configuration.exceptions.ConfigValueCannotBeCoercedException: Configured value for field scopes cannot be coerced into type <class 'list'>
FAILED tests/load/filesystem/test_gcs_credentials.py::test_explicit_filesystem_credentials - dlt.common.configuration.exceptions.ConfigValueCannotBeCoercedException: Configured value for field scopes cannot be coerced into type <class 'list'>

I can't reproduce them, it works on my machine.

It's strange because, looking at the logs, it looks as if deserialize_value is called with key = 'scopes', value = 'None', hint = <class 'list'>. Logically, this cannot happen because that function is only called if value is not None.

def _resolve_config_field(
   ...
        # if value is resolved, then deserialize and coerce it
        if value is not None:
            # do not deserialize explicit values
            if value is not explicit_value:
                value = deserialize_value(key, value, inner_hint)

I'm probably overlooking something. Do you know what's going on?

@rudolfix
Copy link
Collaborator

rudolfix commented Dec 2, 2024

@jorritsandbrink that got fixed on another branch yesterday. so we are good. for some reason docs linting is not passing. pls check it out

@jorritsandbrink
Copy link
Collaborator Author

@jorritsandbrink that got fixed on another branch yesterday. so we are good. for some reason docs linting is not passing. pls check it out

@rudolfix fixed it by deleting chess_games folder in S3.

Error was:

<class 'dlt.common.schema.exceptions.SchemaEngineNoUpgradePathException'>
E           In schema: chess_com_source: No engine upgrade path in schema chess_com_source from 11 to 10, stopped at 11. You possibly tried to run an older dlt version against a destination you have previously loaded data to with a newer dlt version.

on _sync_destination in docs/examples/partial_loading/test_partial_loading.py::test_partial_loading.

Don't think that error came from this branch.

@rudolfix
Copy link
Collaborator

rudolfix commented Dec 3, 2024

@jorritsandbrink now it all LGTM! just here:
https://github.com/dlt-hub/dlt/actions/runs/12125961933/job/33807050441?pr=2067#step:12:9511

I think we are importing iceberg to aggressively into filesystem. we should allow it to work both without delta and iceberg. please check it out!

@jorritsandbrink
Copy link
Collaborator Author

@rudolfix I don't think we're importing pyiceberg too aggressively. Those errors arose because pyiceberg was accidentally removed from pyproject.toml when merging the latest version of devel. It has already been fixed and the latest checks don't contain those errors.

@rudolfix rudolfix merged commit 4e5a240 into devel Dec 11, 2024
57 of 59 checks passed
@rudolfix rudolfix deleted the feat/1996-iceberg-filesystem branch December 11, 2024 08:36
donotpush pushed a commit that referenced this pull request Dec 11, 2024
* add pyiceberg dependency and upgrade mypy

- mypy upgrade needed to solve this issue: apache/iceberg-python#768
- uses <1.13.0 requirement on mypy because 1.13.0 gives error
- new lint errors arising due to version upgrade are simply ignored

* extend pyiceberg dependencies

* remove redundant delta annotation

* add basic local filesystem iceberg support

* add active table format setting

* disable merge tests for iceberg table format

* restore non-redundant extra info

* refactor to in-memory iceberg catalog

* add s3 support for iceberg table format

* add schema evolution support for iceberg table format

* extract _register_table function

* add partition support for iceberg table format

* update docstring

* enable child table test for iceberg table format

* enable empty source test for iceberg table format

* make iceberg catalog namespace configurable and default to dataset name

* add optional typing

* fix typo

* improve typing

* extract logic into dedicated function

* add iceberg read support to filesystem sql client

* remove unused import

* add todo

* extract logic into separate functions

* add azure support for iceberg table format

* generalize delta table format tests

* enable get tables function test for iceberg table format

* remove ignores

* undo table directory management change

* enable test_read_interfaces tests for iceberg

* fix active table format filter

* use mixin for object store rs credentials

* generalize catalog typing

* extract pyiceberg scheme mapping into separate function

* generalize credentials mixin test setup

* remove unused import

* add centralized fallback to append when merge is not supported

* Revert "add centralized fallback to append when merge is not supported"

This reverts commit 54cd0bc.

* fall back to append if merge is not supported on filesystem

* fix test for s3-compatible storage

* remove obsolete code path

* exclude gcs read interface tests for iceberg

* add gcs support for iceberg table format

* switch to UnsupportedAuthenticationMethodException

* add iceberg table format docs

* use shorter pipeline name to prevent too long sql identifiers

* add iceberg catalog note to docs

* black format

* use shorter pipeline name to prevent too long sql identifiers

* correct max id length for sqlalchemy mysql dialect

* Revert "use shorter pipeline name to prevent too long sql identifiers"

This reverts commit 6cce03b.

* Revert "use shorter pipeline name to prevent too long sql identifiers"

This reverts commit ef29aa7.

* replace show with execute to prevent useless print output

* add abfss scheme to test

* remove az support for iceberg table format

* remove iceberg bucket test exclusion

* add note to docs on azure scheme support for iceberg table format

* exclude iceberg from duckdb s3-compatibility test

* disable pyiceberg info logs for tests

* extend table format docs and move into own page

* upgrade adlfs to enable account_host attribute

* Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1996-iceberg-filesystem

* fix lint errors

* re-add pyiceberg dependency

* enabled iceberg in dbt-duckdb

* upgrade pyiceberg version

* remove pyiceberg mypy errors across python version

* does not install airflow group for dev

* fixes gcp oauth iceberg credentials handling

* fixes ca cert bundle duckdb azure on ci

* allow for airflow dep to be present during type check

---------

Co-authored-by: Marcin Rudolf <[email protected]>
donotpush pushed a commit that referenced this pull request Dec 11, 2024
* add pyiceberg dependency and upgrade mypy

- mypy upgrade needed to solve this issue: apache/iceberg-python#768
- uses <1.13.0 requirement on mypy because 1.13.0 gives error
- new lint errors arising due to version upgrade are simply ignored

* extend pyiceberg dependencies

* remove redundant delta annotation

* add basic local filesystem iceberg support

* add active table format setting

* disable merge tests for iceberg table format

* restore non-redundant extra info

* refactor to in-memory iceberg catalog

* add s3 support for iceberg table format

* add schema evolution support for iceberg table format

* extract _register_table function

* add partition support for iceberg table format

* update docstring

* enable child table test for iceberg table format

* enable empty source test for iceberg table format

* make iceberg catalog namespace configurable and default to dataset name

* add optional typing

* fix typo

* improve typing

* extract logic into dedicated function

* add iceberg read support to filesystem sql client

* remove unused import

* add todo

* extract logic into separate functions

* add azure support for iceberg table format

* generalize delta table format tests

* enable get tables function test for iceberg table format

* remove ignores

* undo table directory management change

* enable test_read_interfaces tests for iceberg

* fix active table format filter

* use mixin for object store rs credentials

* generalize catalog typing

* extract pyiceberg scheme mapping into separate function

* generalize credentials mixin test setup

* remove unused import

* add centralized fallback to append when merge is not supported

* Revert "add centralized fallback to append when merge is not supported"

This reverts commit 54cd0bc.

* fall back to append if merge is not supported on filesystem

* fix test for s3-compatible storage

* remove obsolete code path

* exclude gcs read interface tests for iceberg

* add gcs support for iceberg table format

* switch to UnsupportedAuthenticationMethodException

* add iceberg table format docs

* use shorter pipeline name to prevent too long sql identifiers

* add iceberg catalog note to docs

* black format

* use shorter pipeline name to prevent too long sql identifiers

* correct max id length for sqlalchemy mysql dialect

* Revert "use shorter pipeline name to prevent too long sql identifiers"

This reverts commit 6cce03b.

* Revert "use shorter pipeline name to prevent too long sql identifiers"

This reverts commit ef29aa7.

* replace show with execute to prevent useless print output

* add abfss scheme to test

* remove az support for iceberg table format

* remove iceberg bucket test exclusion

* add note to docs on azure scheme support for iceberg table format

* exclude iceberg from duckdb s3-compatibility test

* disable pyiceberg info logs for tests

* extend table format docs and move into own page

* upgrade adlfs to enable account_host attribute

* Merge branch 'devel' of https://github.com/dlt-hub/dlt into feat/1996-iceberg-filesystem

* fix lint errors

* re-add pyiceberg dependency

* enabled iceberg in dbt-duckdb

* upgrade pyiceberg version

* remove pyiceberg mypy errors across python version

* does not install airflow group for dev

* fixes gcp oauth iceberg credentials handling

* fixes ca cert bundle duckdb azure on ci

* allow for airflow dep to be present during type check

---------

Co-authored-by: Marcin Rudolf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

write iceberg tables on filesystem destination
2 participants