Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow easier partitioning and compaction in Delta tables written to filesystem #2062

Open
arjun-panchmatia-mechademy opened this issue Nov 13, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@arjun-panchmatia-mechademy

Feature description

It is currently possible to set partitioning strategy when writing to deltalake on Cloud / local. This is achieved by passing relevant parameters to the decorator / applying hints. However, it doesn't look like the filesystem destination natively supports complex partitioning strategies. Example: given a resource that emits timestamps, it is not possible to achieve granular partitioning based on year, month, and day.

Considering how common the above use-case is, it'd be very useful to have. A current workaround involves creating the year, month, and day columns in the resource itself and then using those. However, for smaller tables (as is the case with time-series data), that would incur needless storage and compute costs.

Compaction would also be nice to have, considering near real-time tables tend to have very frequent writes with each file being small in nature. This quickly makes it very hard to query data when using something like polars or duckdb to read directly from the deltalake.

Are you a dlt user?

None

Use case

We're precisely trying to find a way to partition our data without appending additional fields, as described above.

Proposed solution

I am unsure of what the syntax for this would look like, but considering how common this usecase is, I believe datetime object-specific checks could probably be integrated?

Related issues

No response

@rudolfix rudolfix self-assigned this Nov 18, 2024
@rudolfix rudolfix added the question Further information is requested label Nov 18, 2024
@rudolfix
Copy link
Collaborator

@arjun-panchmatia-mechademy regarding compacting and vacuuming: this is already supported
https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#get_delta_tables-helper
in essence: after a pipeline run, you can use the pipeline to get all the delta tables and then vacuum them or rebuild indexes. this is just a few lines of code. we decided it is better if that happens outside the load process.

metadata is compacted by delta-rs each (AFAIK) 100 runs of the pipeline.

now regarding partitions: we need to how to add generated tables with delta-rs. then we can use resource adapter to generate custom hints. best way to understand what adapter does is to look at BigQuery adapter:
https://dlthub.com/docs/dlt-ecosystem/destinations/bigquery#use-an-adapter-to-apply-hints-to-a-resource
in case of delta we can use it to specify generated columns, partitions (and maybe more).
@jorritsandbrink FYI. we'll take a look at delta-rs soon.

@jorritsandbrink
Copy link
Collaborator

@rudolfix I think delta-rs currently doesn't support writing generated columns, at least not in the Python binding: delta-io/delta-rs#2210 (reply in thread).

@rudolfix rudolfix moved this from In Progress to Planned in dlt core library Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Planned
Development

No branches or pull requests

3 participants