Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add plan tasks for TableScan #1427
base: main
Are you sure you want to change the base?
Add plan tasks for TableScan #1427
Changes from all commits
5209206
e75ad92
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it's intentional that we're not actually calling this in any of the methods (like
to_arrow
, etc) that actually execute a scan?If the plan is to call this in
to_arrow
eventually, it would be good to have some benchmarks with realistic latencies (e.g., against actual object storage).If there is no plan to call this directly in pyiceberg, I wonder who the intended consumers of this API would be? I would expect most query engines -- distributed or otherwise -- to have their own notions for how to optimize scan planning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, assuming there's still some future plan to offload expensive stuff like data scanning to a rust core, I do wonder if we want to commit to exposing additional scan planning APIs that may not align with whatever query engine gets embedded in pyiceberg?
In the case that this is primarily intended to be for the benefit of pyiceberg's own scan execution I would consider making this a private API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, I originally thought the
plan_task
function was being used. Looks like I confusedplan_files
withplan_task
. Currentlyplan_files
is used by the read pathhttps://grep.app/search?q=plan_files&filter[repo][0]=apache/iceberg-python&filter[path][0]=pyiceberg/table/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am not sure of the initial intention why we read the full file. So keep it not changing.
I met OOM errors recently when we read the iceberg table with ray in distributed. Also, the parallelism is limited by the number of files when planning with files(files are rewritten into larger one, eg 512MB or 1GB). Like Spark/Flink, I think the
plan_tasks
should be useful for distributed reading.The distributed engine such as ray/daft would have to implement themself if they want to do
plan_tasks
. It would be better to put it into pyiceberg?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same concern as @corleyma. For example, Java splits on row-groups to make full use of the parallelism of Spark, Trino, Hive etc. On a local machine, it makes more sense to just plow through the file itself, preferably using a native reader like PyArrow or Iceberg-Rust in the future.
There are a lot of details around this API, for example, a task might point to a row-group that doesn't contain any relevant information, and we don't know until we open the file itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we perform our own Plan Files splitting and merging. We create an iterator from pyiceberg where @kevinjqliu linked the code. But during our own optimization passes we perform fusion of files as well as splitting by row group.
The logic for that can be found here
https://github.com/Eventual-Inc/Daft/blob/e148248dae8af90c8993d2ec6b2f471521c0a7f2/src/daft-scan/src/scan_task_iters.rs#L181
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @samster25 One question, do you plan purely on the data provided by PyIceberg, or do first fetch the footers of the Parquet files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko If have metadata from pyiceberg like num_rows, file_size, num_row_groups (from split_offsets), we skip fetching the Parquet Footer and speed up planning.
On that note, we would also love to have
distinct_counts
in the DataFile as well. This would help us estimate memory usage when planning a query in Daft.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re what @ConeyLiu was mentioning about the issues he encountered with Ray, I think it would be reasonable for pyiceberg to have this more nuanced distributed planning logic in a private API and (if we wanted to), to expose a RayDatasets implementation that uses the private API... but I'd vote against committing to this planning logic in a public API for the reasons outlined above (engines want to bring their own logic, and pyiceberg may well want to significantly change this logic for itself in the future to the extent that we do continue to embed more query engine functionality for convenience).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjqliu added more tests here, pls give more advice.