[DISCUSSION] Make it easier and faster to query remote files (S3, iceberg, etc) #13456

alamb · 2024-11-17T18:55:30Z

Is your feature request related to a problem or challenge?

I personally think making it easy to use DataFusion with the "open data lake" stack is very important over the next few months.

@julienledem wrote up a very nice piece describing The advent of the Open Data Lake

The high level idea is to make it really easy for people to build systems that query (quickly!) from parquet files stored on remote object store, including Apache Iceberg, Delta Lake, Hudi, etc.

You can already use DataFusion (and datafusion-cli) to query such data, but it takes non trivial effort to configure and tune for good performance. My idea is to make it easier to do so / make DataFusion better out of the box.

With that as a building block, people could/would build applications and systems targeting specific usecases

I don't yet fully understand where we currently stand on this goal, but I wanted to start hte discussio

Describe the solution you'd like

In my mind, the specific work this entails stuff like

Making it easier to use iceberg/delta/hudi with DataFusion
Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) #12393
Make parquet reader in arrow-rs faster/better on remote object stores
Making it eaiser to cache parquet metadata

Describe alternatives you've considered

One specific item, brought up by @MrPowers would be to try DataFusion with the "10B row challenge" described in https://dataengineeringcentral.substack.com/p/10-billion-row-challenge-duckdb-vs .

I suspect it would be non ideal at first, but trying it to figure out what the challenges are would help us focus our efforts

Additional context

No response

The text was updated successfully, but these errors were encountered:

comphead · 2024-11-18T17:29:14Z

What about remote HDFS files support? We have a contribution project https://github.com/datafusion-contrib/datafusion-objectstore-hdfs which supposed to query HDFS, but not sure how far we are with that

alamb · 2024-11-19T12:03:14Z

Yes I think HDFS would be another good target

Basically I want to make sure that it is as easy as possible to use DataFusion to query data that lives on remote systems (aka where the data is not on some local NVME but must be accessed over the network)

jonathanc-n · 2024-11-21T03:50:20Z

I forgot to mention this here, apache/iceberg-rust#700 (write support) is a really nice issue opened up in iceberg-rust. Getting the rust implementation of iceberg up and going would probably help out datafusion a bit on the data lake side of things.

alamb · 2024-11-21T21:46:41Z

I forgot to mention this here, apache/iceberg-rust#700 (write support) is a really nice issue opened up in iceberg-rust. Getting the rust implementation of iceberg up and going would probably help out datafusion a bit on the data lake side of things.

Yes, 100% -- one of my goals is to make it easy for this to "just work" with DataFusion. I think we are a bit away from it at the moment

matthewmturner · 2024-11-22T13:55:28Z

@alamb from a datafusion perspective which parts do you think are missing? I ask about just the datafusion perspective because i am assuming the owners of the relevant table formats will be implementing their spec / protocol.

matthewmturner · 2024-11-22T17:38:28Z

The way i was thinking about it the thing that makes it interesting / difficult is getting the semantics of each of the formats as part of datafusion which i would presume needs to be done as either SQL extensions and/or custom execution plans. Without knowing much about the specifics i thought that was already doable - but maybe im missing something.

alamb · 2024-11-22T22:45:53Z

@alamb from a datafusion perspective which parts do you think are missing? I ask about just the datafusion perspective because i am assuming the owners of the relevant table formats will be implementing their spec / protocol.

I am thinking mostly the list

Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) #12393
Make an example showing how to use delta.rs / iceberg via FFI (and avoid version dependency hell)
Make an example showing how to cache parquet metadata etc (not sure about this one yet -- I need to get it to the point where it is easy / convenient to read remote files first)

matthewmturner · 2024-11-23T01:01:01Z

Thanks @alamb

I plan to work on the second item - probably in December. I was really hoping to get a dft release out shortly where all the custom table providers (Delta / Iceberg / Hudi - and potentially Lance) were on DF v43 - but that might be wishful thinking.

Depending how the next week goes i will make a judgement call on whether or not to wait. If it isnt looking promising that everyone will be on v43 then my next priority will be implementing the FFI.

blaginin · 2024-12-03T21:44:37Z

I think one problem with the current implementation of external storages is that it's pretty hard to test properly. For example, the issue I solve in #13576 happened because right now, we only test that the external storage parameters are parsed, but we don’t even check if they’re parsed correctly.

Maybe we should start mocking aws/iceberg/... and move more towards integration testing? That way, we’d be more confident that our external storage support actually works 😅

comphead · 2024-12-03T21:49:11Z

@alamb I will try to cover how DataFusion works with remote HDFS files if that fits.
I'm planning to experiment with datafusion-objectstore-hdfs and make dev/tests/examples/docs so the user can easily query HDFS files from Datafusion CLI

alamb · 2024-12-04T01:07:53Z

Maybe we should start mocking aws/iceberg/... and move more towards integration testing? That way, we’d be more confident that our external storage support actually works 😅

Yes 100% -- maybe we can take a friendly look at the emultors used in object_store tests

https://github.com/apache/arrow-rs/blob/9047d99f6bf87582532ee6ed0acb3f2d5f889f11/.github/workflows/object_store.yml#L91-L184

timvw · 2025-01-09T08:21:30Z

In qv I use minio to test whether s3 integration works...

https://github.com/timvw/qv/blob/main/tests/files_on_s3.rs#L9
https://github.com/timvw/qv/blob/main/ci/minio_start.sh

Integration with delta-rs was always pretty easy, iceberg-rust|rs has been more difficult (mostly because you also need to access a catalog to get the correct/current relevant metadata to get started...)

alamb added the enhancement New feature or request label Nov 17, 2024

alamb mentioned this issue Nov 17, 2024

[DISCUSSION] 2024 Q4 / 2025 Q1 Roadmap #13274

Open

blaginin mentioned this issue Nov 26, 2024

Add configurable normalization for configuration options and preserve case for S3 paths #13576

Merged

blaginin mentioned this issue Dec 5, 2024

Add snapshot testing to CLI & set up AWS mock #13672

Open

alamb changed the title ~~[DISCUSSION] Make it easy and fast to query files on remote files (S3, iceberg, etc)~~ [DISCUSSION] Make it easier and faster to query files on remote files (S3, iceberg, etc) Dec 21, 2024

alamb changed the title ~~[DISCUSSION] Make it easier and faster to query files on remote files (S3, iceberg, etc)~~ [DISCUSSION] Make it easier and faster to query remote files (S3, iceberg, etc) Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Make it easier and faster to query remote files (S3, iceberg, etc) #13456

[DISCUSSION] Make it easier and faster to query remote files (S3, iceberg, etc) #13456

alamb commented Nov 17, 2024

comphead commented Nov 18, 2024

alamb commented Nov 19, 2024 •

edited

Loading

jonathanc-n commented Nov 21, 2024 •

edited

Loading

alamb commented Nov 21, 2024

matthewmturner commented Nov 22, 2024

matthewmturner commented Nov 22, 2024

alamb commented Nov 22, 2024

matthewmturner commented Nov 23, 2024

blaginin commented Dec 3, 2024 •

edited

Loading

comphead commented Dec 3, 2024

alamb commented Dec 4, 2024

timvw commented Jan 9, 2025

[DISCUSSION] Make it easier and faster to query remote files (S3, iceberg, etc) #13456

[DISCUSSION] Make it easier and faster to query remote files (S3, iceberg, etc) #13456

Comments

alamb commented Nov 17, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

comphead commented Nov 18, 2024

alamb commented Nov 19, 2024 • edited Loading

jonathanc-n commented Nov 21, 2024 • edited Loading

alamb commented Nov 21, 2024

matthewmturner commented Nov 22, 2024

matthewmturner commented Nov 22, 2024

alamb commented Nov 22, 2024

matthewmturner commented Nov 23, 2024

blaginin commented Dec 3, 2024 • edited Loading

comphead commented Dec 3, 2024

alamb commented Dec 4, 2024

timvw commented Jan 9, 2025

alamb commented Nov 19, 2024 •

edited

Loading

jonathanc-n commented Nov 21, 2024 •

edited

Loading

blaginin commented Dec 3, 2024 •

edited

Loading