-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE Partitions in AWS Athena #555
Comments
@sh-rp let's take a look at native partitioning that pyarrow has. OFC in that case we'll need to deal with several files in folders so our jobs mechanism are getting complicated (but why not to have jobs that are folders ;> OMG) in Weaviate PR we are adding a concept of a destination adapter that injects additional information in schema to be interpreted by loader/normalizer (and a data writer - also has access to it). so it all fits together but still looks like bigger project |
Documenting a potential use-case. I'm loading files from S3 + SFTP + API sources to S3 destination. I'd like to use properties from the source (not data itself) to design the partition layout in the destination. The properties may be resource arguments, or indirectly source metadata like object path and/or file name. Examples:
How others have (not) done it:
Implications for this ticket:
TL;DR I'm looking to partition based on metadata, not the data itself. |
Just wanted to add a thing to think about. Iceberg tables make use of hidden partitioning. And keep track of different partitioning over time. https://trino.io/blog/2021/05/03/a-gentle-introduction-to-iceberg#hidden-partitions Considering this issue is about AWS Athena partitions I do not believe this is out of scope and might be the "simpler" part of the issue. Best |
(Partly) subsumed by #930 |
In case native iceberg tables are used in Athena, the partition implementation can be delegated to iceberg directly (hidden partitioning), and it's possible to use an implementation similar to what is done in BigQuery, where the PARTITION BY clause is added via SQL. Pretty much as @MRocholl mentioned. For pure parquet writing of course what mention doesn't work, because partitions are part of the object path of the data written to S3. @sh-rp @rudolfix @sultaniman as I'm interested in using Athena/Iceberg destination with partitions, do you see anything against what I proposed above? it's just about adding the right SQL for iceberg managed tables. |
@nicor88 I've been working on extending layout placeholders for filesystem destination, atm we don't yet support syncing or taking date & time from parquet files for partitioning. |
@sultaniman Layout placehoders for file-system for what I can see won't work with Iceberg athena managed tables. For example an iceberg table can be created like that:
The writing to the file systems is handle under the hood by athena itself in case of Iceberg table, not by the writer (dlt), this because of Iceberg hidden partitioning. Therefore the only thing to do with athena iceberg tables is to specify the partition definition on table creation, or as a post-hook (e.g.using pyiceberg) to modify the partition specs of a table. Said so, I was planning to propose a PR for that, because it's the only limitation that is blocking us to use dlt in production. All our tables are partitioned by ingestion date for example, to reduce data scans down-streams. Happy to have a chat with more details if the above is not clear. |
@nicor88 sure we can have a chat about it, I think once we merge the original PR I will be more that happy for allowing iceberg support 🙂 |
We want to enable the user to make use of AWS Athena partitions. For this we will most likely need to change the schema a bit and add the option for the user to define partitions on the resources.
Analyzing the discussion on this topic, we can implement this in 3 steps of increasing complexity (3 PRs expected)
support iceberg tables
here partitioning is available out of the box on the create table command. partitions are created based on the data in the tables
support regular, hive tables
after loading a file ADD PARTITION must be executed to map file into a partition. paritions are created based on the metadata that must be available via file layout placeholders
so if a given partition column is declared in the adapter - the same column name must be a placeholder in the file layout (or be present in placeholders list)
path_utils.py
that are used to render file names. see howfilesystem
does that. we may need to give athena access to rendering file names viastaging_config
note: I'm not sure that the table must have a partition column physically or we just may declare it and fill the data from layout placeholder. that needs to be tested
hive partitioning based on table data
when #1382 is merged we can easily add more table formats (ie. hive) and use it for partitioning. form our discussion we know that
MSCK REPAIR TABLE
costs too much for incremental load so I do not yet see a value in itthis is old disussion
~Tasks:
To think about:
The text was updated successfully, but these errors were encountered: