PyIceberg Cookbook #1201

kevinjqliu · 2024-09-24T17:35:10Z

Feature Request / Improvement

It was brought up at the recent community sync that we should start a cookbook to capture different use cases with PyIceberg. Similar to the Tabular Iceberg cookbook

Starting this issue to track the creation of the cookbook. And more importantly, what items people would like to see to be included in the cookbook.

Feel free to add suggestions below.

kevinjqliu · 2024-09-24T17:36:26Z

Copying over from community sync

Cookbook suggestions

Support for incremental processing with "change table" (link)
Create a table like another table
Get data file references between two given snapshot ids or timestamps

shiv-io · 2024-10-19T16:29:57Z

@kevinjqliu are you accepting contributions for this cookbook yet? Happy to help if so!

kevinjqliu · 2024-10-19T18:31:57Z

Hi @shiv-io, yes, we're accepting contributions. We currently don't have a page set up for the cookbook yet.

francocalvo · 2024-11-06T13:23:23Z

Hey! I'm creating a PoC using PyIceberg for a project. I'm quite interested in incremental processing.

For this, what I've used before were MERGE operations to update the table (I was using Delta with Spark at the time) with data from a DataFrame.

Is this possible yet? Something similar would be overwrite + overwrite_filter, but I can't really use that with a DataFrame, I'd have to pass it as a string, right? And in that case, a IN clause with thousands of IDs would deteriorate performance

kevinjqliu · 2024-11-06T17:16:23Z

hey @francocalvo
the MERGE operation is not yet support (#402)
For write, pyiceberg currently supports append and overwrite. I think overwrite + overwrite_filter gets you close to the MERGE use case.

but I can't really use that with a DataFrame, I'd have to pass it as a string, right?

The writes work with pyarrow tables and dataframe. Im don't think you need to pass as string

And in that case, a IN clause with thousands of IDs would deteriorate performance

It depends on the exact logic. But we do some optimizations such as filter pushdowns to speed up reads and writes

francocalvo · 2024-11-07T13:33:31Z

Thank you for the prompt answer!

The writes work with pyarrow tables and dataframe. Im don't think you need to pass as string

Yes, what I mean is when I need to update an Iceberg table using a Arrow table. In other cases I used a MERGE with a WHEN MATCHED UPDATE clause. This allowed me to 'soft-delete' old versions (It's a SCD Type 2 table). In some cases, I need to update +10k rows in one go, and match them based on an ID.
Reading the code, I see that I can write with Arrow tables, but not create filters for that.

In any case, I'm glad this exists and hope the cookbook creates a good starting point for people that are trying this out.

kevinjqliu pinned this issue Sep 24, 2024

kevinjqliu mentioned this issue Dec 6, 2024

[Request] Area of Improvements for Documentation #1407

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyIceberg Cookbook #1201

PyIceberg Cookbook #1201

kevinjqliu commented Sep 24, 2024

kevinjqliu commented Sep 24, 2024

shiv-io commented Oct 19, 2024

kevinjqliu commented Oct 19, 2024

francocalvo commented Nov 6, 2024

kevinjqliu commented Nov 6, 2024 •

edited

Loading

francocalvo commented Nov 7, 2024

PyIceberg Cookbook #1201

PyIceberg Cookbook #1201

Comments

kevinjqliu commented Sep 24, 2024

Feature Request / Improvement

kevinjqliu commented Sep 24, 2024

shiv-io commented Oct 19, 2024

kevinjqliu commented Oct 19, 2024

francocalvo commented Nov 6, 2024

kevinjqliu commented Nov 6, 2024 • edited Loading

francocalvo commented Nov 7, 2024

kevinjqliu commented Nov 6, 2024 •

edited

Loading