Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended filesystem layout placeholders for destinations #930

Closed
sultaniman opened this issue Feb 2, 2024 · 1 comment
Closed

Extended filesystem layout placeholders for destinations #930

sultaniman opened this issue Feb 2, 2024 · 1 comment
Assignees
Labels
destination Issue related to new destinations enhancement New feature or request support This issue is monitored by Solution Engineer

Comments

@sultaniman
Copy link
Contributor

sultaniman commented Feb 2, 2024

Problem statement

There are already few requests to extend the supported variable placeholders for layout option, which has the following default value

[destination.filesystem]
layout="{table_name}/{load_id}.{file_id}.{ext}"

There are already couple of requests on GitHub 1, 2 and slack. This means users would love to have more fine grained control for layout formatting options using more placeholders.

Request #1

curr_year, curr_month, curr_day, curr_hour and curr_minute

Request #2

Partition config should support resource parameters, for example, if date (or year, month, date) is a parameter to a resource, that should be available in the partition layout.

Partition config should support resource metadata, such as object path or creation date, for example, in order to replicate the partitioning from the source to destination.

Request #3

For us we would also like to be able to partition based on date for which the job is running. Very similar to airflow's ds field

So we can have hive style partitions s3:///..../year={job_run.year}/month={job_run.month/day={job_run.day}/table/schema/...

It would also be nice to be able to pass in these kind of fields as parameters to the dlt operator on airflow.

TODO

We need to augment LoadFilesystemJob.make_destination_filename and expand the the available parameters with datetime fields for year, month, day, hour, minute, second, microsecond

Extending [SUPPORTED_PLACEHOLDERS](https://github.com/dlt-hub/dlt/blob/devel/dlt/destinations/path_utils.py#L10) with known set of variables should suffice the minimum use case needs currently it has the following values

SUPPORTED_PLACEHOLDERS = {"schema_name", "table_name", "load_id", "file_id", "ext", "curr_date"}

After the extension this will look like

SUPPORTED_PLACEHOLDERS = {"schema_name", "table_name", "load_id", "file_id", "ext", "curr_date", "year", "month", "day", "hour", "minute", "second", "microsecond"}

@rudolfix
Copy link
Collaborator

@Pipboyguy this and #759 are related. This is our best scope description: #555 (comment)

My take here

  1. We should let users to create path names (and thus partition) on any kind of metadata they want
  2. Possibly the best way would be for filesystem to accept a callback that creates a file name on top of existing format specifier
  3. It could also allow people to define custom formatter where they can bind a formatter to a value or a function that will be resoled at run time
  4. mind that certain order of formatter is required for replace write disposition in filesystem destination.
  5. @Pipboyguy you are free to merge the tickets and create a scope + ping users that were involved there

@Pipboyguy Pipboyguy added destination Issue related to new destinations enhancement New feature or request community This issue came from slack community workspace labels Feb 26, 2024
Pipboyguy added a commit that referenced this issue Feb 26, 2024
Pipboyguy added a commit that referenced this issue Mar 3, 2024
Signed-off-by: Marcel Coetzee <[email protected]>
@rudolfix rudolfix assigned sh-rp and unassigned Pipboyguy Mar 11, 2024
@rudolfix rudolfix moved this from In Progress to Planned in dlt core library Mar 18, 2024
@sultaniman sultaniman moved this from Planned to In Progress in dlt core library Apr 2, 2024
@rudolfix rudolfix added support This issue is monitored by Solution Engineer and removed community This issue came from slack community workspace labels Apr 2, 2024
@sultaniman sultaniman moved this from In Progress to Done in dlt core library Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
destination Issue related to new destinations enhancement New feature or request support This issue is monitored by Solution Engineer
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants