Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyIceberg Cookbook #1201

Open
kevinjqliu opened this issue Sep 24, 2024 · 6 comments
Open

PyIceberg Cookbook #1201

kevinjqliu opened this issue Sep 24, 2024 · 6 comments

Comments

@kevinjqliu
Copy link
Contributor

Feature Request / Improvement

It was brought up at the recent community sync that we should start a cookbook to capture different use cases with PyIceberg. Similar to the Tabular Iceberg cookbook

Starting this issue to track the creation of the cookbook. And more importantly, what items people would like to see to be included in the cookbook.

Feel free to add suggestions below.

@kevinjqliu
Copy link
Contributor Author

Copying over from community sync

Cookbook suggestions

  • Support for incremental processing with "change table" (link)
  • Create a table like another table
  • Get data file references between two given snapshot ids or timestamps

@kevinjqliu kevinjqliu pinned this issue Sep 24, 2024
@shiv-io
Copy link

shiv-io commented Oct 19, 2024

@kevinjqliu are you accepting contributions for this cookbook yet? Happy to help if so!

@kevinjqliu
Copy link
Contributor Author

Hi @shiv-io, yes, we're accepting contributions. We currently don't have a page set up for the cookbook yet.

@francocalvo
Copy link

Hey! I'm creating a PoC using PyIceberg for a project. I'm quite interested in incremental processing.

For this, what I've used before were MERGE operations to update the table (I was using Delta with Spark at the time) with data from a DataFrame.

Is this possible yet? Something similar would be overwrite + overwrite_filter, but I can't really use that with a DataFrame, I'd have to pass it as a string, right? And in that case, a IN clause with thousands of IDs would deteriorate performance

@kevinjqliu
Copy link
Contributor Author

kevinjqliu commented Nov 6, 2024

hey @francocalvo
the MERGE operation is not yet support (#402)
For write, pyiceberg currently supports append and overwrite. I think overwrite + overwrite_filter gets you close to the MERGE use case.

but I can't really use that with a DataFrame, I'd have to pass it as a string, right?

The writes work with pyarrow tables and dataframe. Im don't think you need to pass as string

And in that case, a IN clause with thousands of IDs would deteriorate performance

It depends on the exact logic. But we do some optimizations such as filter pushdowns to speed up reads and writes

@francocalvo
Copy link

Thank you for the prompt answer!

The writes work with pyarrow tables and dataframe. Im don't think you need to pass as string

Yes, what I mean is when I need to update an Iceberg table using a Arrow table. In other cases I used a MERGE with a WHEN MATCHED UPDATE clause. This allowed me to 'soft-delete' old versions (It's a SCD Type 2 table). In some cases, I need to update +10k rows in one go, and match them based on an ID.
Reading the code, I see that I can write with Arrow tables, but not create filters for that.

In any case, I'm glad this exists and hope the cookbook creates a good starting point for people that are trying this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants