Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues/29: Add version guessing functionality #79

Merged
merged 13 commits into from
Nov 26, 2024

Conversation

teaguesterling
Copy link
Contributor

@teaguesterling teaguesterling commented Nov 12, 2024

This adds a new (default) version option that can be passed to iceberg table functions.

When version="?", some simple logic will go into effect that does the following:

  • Checks for a version-hint.text file, using it if it exists.
  • If not, it will Glob the metadata directory using the version_format patterns, and select the lexicographically "largest" metadata filename (e.g., v00002-xxxxxxxxxxxxx.metadata.json will be picked over v00001-xxxxxxxxxxxxx.metadata.json ).

There's additional tests needed here and I'm running into a seemingly unrelated parquet bug when testing. There aslo some room for improvement on the logic of how versions are found and selected. However, at the very least, this works as we've all hoped from #29 :

D SELECT file_path FROM iceberg_metadata('lineitem_iceberg_no_hint');
┌────────────────────────────────────────────────────────────────────────────────────┐
│                                     file_path                                      │
│                                      varchar                                       │
├────────────────────────────────────────────────────────────────────────────────────┤
│ lineitem_iceberg/data/00041-414-f3c73457-bbd6-4b92-9c15-17b241171b16-00001.parquet │
│ lineitem_iceberg/data/00000-411-0792dcfe-4e25-4ca3-8ada-175286069a47-00001.parquet │
└────────────────────────────────────────────────────────────────────────────────────┘
D SELECT file_path FROM iceberg_metadata('lineitem_iceberg_no_hint', version='1');
┌────────────────────────────────────────────────────────────────────────────────────┐
│                                     file_path                                      │
│                                      varchar                                       │
├────────────────────────────────────────────────────────────────────────────────────┤
│ lineitem_iceberg/data/00000-411-0792dcfe-4e25-4ca3-8ada-175286069a47-00001.parquet │
└────────────────────────────────────────────────────────────────────────────────────┘
D SELECT file_path FROM iceberg_metadata('lineitem_iceberg_no_hint', version='2');
┌────────────────────────────────────────────────────────────────────────────────────┐
│                                     file_path                                      │
│                                      varchar                                       │
├────────────────────────────────────────────────────────────────────────────────────┤
│ lineitem_iceberg/data/00041-414-f3c73457-bbd6-4b92-9c15-17b241171b16-00001.parquet │
│ lineitem_iceberg/data/00000-411-0792dcfe-4e25-4ca3-8ada-175286069a47-00001.parquet │
└────────────────────────────────────────────────────────────────────────────────────┘
D .system ls lineitem_iceberg_no_hint/metadata
10eaca8a-1e1c-421e-ad6d-b232e5ee23d3-m0.avro
10eaca8a-1e1c-421e-ad6d-b232e5ee23d3-m1.avro
cf3d0be5-cf70-453d-ad8f-48fdc412e608-m0.avro
snap-3776207205136740581-1-cf3d0be5-cf70-453d-ad8f-48fdc412e608.avro
snap-7635660646343998149-1-10eaca8a-1e1c-421e-ad6d-b232e5ee23d3.avro
v1.metadata.json
v2.metadata.json

Signed-off-by: Teague Sterling <[email protected]>
Signed-off-by: Teague Sterling <[email protected]>
Signed-off-by: Teague Sterling <[email protected]>
@samansmink
Copy link
Collaborator

Thanks for the PR @teaguesterling, I think this makes sense! There do seem to be some tests failing related to this PR

@teaguesterling
Copy link
Contributor Author

Thanks for the PR @teaguesterling, I think this makes sense! There do seem to be some tests failing related to this PR

Thanks, @samansmink. I'm trying to debug the test failures. I added some new ones to cover the new functionality introduced here but was seeing failures in untouched parts of the code related to the explicit_cardinality parameter in parquet scanning. Removing those bits caused tests to not throw exceptions but instead return incorrect results, perhaps related to deletion handing. I've been working off the main branch. Is there a different branch to consider using for testing?

@teaguesterling
Copy link
Contributor Author

@samansmink It seems there was some weirdness with the test data and cases in my branch. I've cleaned those up and made some fixes to get this to pass.

src/common/iceberg.cpp Outdated Show resolved Hide resolved
src/common/iceberg.cpp Outdated Show resolved Hide resolved
@teaguesterling
Copy link
Contributor Author

teaguesterling commented Nov 18, 2024 via email

src/common/iceberg.cpp Outdated Show resolved Hide resolved
@teaguesterling
Copy link
Contributor Author

I will update the test cases to cover the new flag. Thank you for implementing that!

@teaguesterling
Copy link
Contributor Author

This PR addresses the review comments from @Tishj and @samansmink and updates the test cases to cover the new config variable.

However, it seems that the OSX build is failing before getting building or testing the extension.

@samansmink
Copy link
Collaborator

I've fixed the problem with CI and opened a PR to your branch here teaguesterling#1 (for some sort of reason i can't seem to push directly to this branch using the GH cli like i normally can)

Copy link
Collaborator

@samansmink samansmink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot @teaguesterling

@samansmink samansmink merged commit 27803ae into duckdb:main Nov 26, 2024
16 checks passed
@teaguesterling
Copy link
Contributor Author

Looks great, thanks a lot @teaguesterling

Thanks for all your support on this! I had been looking for a way to familiarize myself with the DuckDB extension system.

teaguesterling added a commit to teaguesterling/duckdb-web that referenced this pull request Nov 28, 2024
Update the duckdb_iceberg documentation to better describe some of the optional version resolution functionality and document the version guessing behavior provided by duckdb/duckdb-iceberg#79. Also add a table of common parameters at the top of the document for easier reference.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants