PyIceberg Production Use case survey #1202

kevinjqliu · 2024-09-24T17:39:55Z

Feature Request / Improvement

As part of the journey toward version 1.0, we want to capture how this library is used in "production" environments.

Would love to hear from current users (and potential users) on different use cases. This will better inform the future roadmap.

Please include use cases in this issue, or if necessary I can start a Google Survey.

mariotaddeucci · 2024-10-15T00:55:07Z

Hey, actually I'm using in production for small datasets in combination with duckdb specially to avoid small files with webscrapping.

For ingestion, reading many raw files (json, csv, and parquet), all off then with a key using ulid (sortable id is necessary) in combination with overwrite specifying this key as overwrite filter.
Duckdb generates a record_batach_reader, which allows to generate the table and schema without load all in memory, after creating the table is necessary to converte into a arrow table to write the final iceberg table.

Because of the sortable id, it's possible to use the the filter predicate overwriting the data between upper and lower bound the data set to be ingested.

The table maintenance still using spark for expiring snapshot.

To avoid small files, after certain period using the duckdb native iceberg read, I reload the entire dataset and overwrite it fully (a workaround for rewrite files procedure)

I would love to expand it for more scenarios but some features are necessary like

allow to write using record_batch_reader, so no need to load a full arrow table in memory.
clear snapshots from pyiceberg, that's turns the maintenance easier, no external engine or tool
maybe a simple optimization like binpack, is not the best but it's better than read all and overwrite it.
Maybe an integration with duckdb, just taking the last metada location and creating a view on it using their native iceberg reader
a truly merge operation, so avoiding errors when doing upserts, making not necessary to use the upper and lower bound of DF key as overwrite filter.

These pipelines are leaving from spark server and running on isolated containers.

andreapiso · 2024-10-21T13:00:25Z

Using pyiceberg alongside Trino. Our ETL is in Trino, pyiceberg Is great for assets where we are doing things like grabbing data from APIs. Instead of storing files and crawling them with something like glue into iceberg tables, we can directly write that data into iceberg so that our Trino pipelines can process it directly, super convenient!

djouallah · 2024-10-28T00:05:54Z

I use it mainly for testing xtable conversion from iceberg to delta, it is by the far the easiest way to generate Iceberg tables :)

emorfam · 2024-11-13T08:54:48Z

Currently using PyIceberg for monitoring metadata statistics of Iceberg tables in a custom application (e.g. file count, record count, data distribution across partitions). We periodically compute these statistics and write them to Postgres and hook it up to Grafana. This gives us a better idea how to optimize Iceberg tables further (e.g. partition layout).

In the long run we would like to use PyIceberg as a low-cost alternative to Glue streaming (possibly with AWS Lambda or Quix-Streams inside of Fargate). This is especially interesting for applications that are low-volume in data but have harder requirements on timeliness of data compared to batch jobs. Here are some example use cases:

Processing assembly-trees in manufacturing that change over time.
Ingesting sensor data from production plants that can contain duplicate messages.

MERGE support would be really helpful here. I guess handling the amount of data that is being loaded from target table during the MERGE operation (e.g. with push-down predicates) will be the biggest obstacle.

Thanks for the great work that the Iceberg community is doing.

randypitcherii · 2024-11-13T14:20:59Z

I use it to mirror tables from one catalog to another all the time. I have scheduled production jobs that do this mirroring before and after my dbt builds.

Pyiceberg is just the best library.

manuzhang · 2024-12-03T08:08:26Z

Maybe we can enable Discussions like https://github.com/apache/iceberg-rust/discussions for this purpose

kevinjqliu pinned this issue Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyIceberg Production Use case survey #1202

PyIceberg Production Use case survey #1202

kevinjqliu commented Sep 24, 2024

mariotaddeucci commented Oct 15, 2024

andreapiso commented Oct 21, 2024

djouallah commented Oct 28, 2024

emorfam commented Nov 13, 2024

randypitcherii commented Nov 13, 2024

manuzhang commented Dec 3, 2024

PyIceberg Production Use case survey #1202

PyIceberg Production Use case survey #1202

Comments

kevinjqliu commented Sep 24, 2024

Feature Request / Improvement

mariotaddeucci commented Oct 15, 2024

andreapiso commented Oct 21, 2024

djouallah commented Oct 28, 2024

emorfam commented Nov 13, 2024

randypitcherii commented Nov 13, 2024

manuzhang commented Dec 3, 2024