-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Design] Delta and Unity catalog exporters implementation #6943
[Design] Delta and Unity catalog exporters implementation #6943
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked this design a lot, thank you for doing it.
Adding questions inline
Following the [catalog exports issue](https://github.com/treeverse/lakeFS/issues/6461), the Delta Lake log will be | ||
exported to the `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}/_delta_log/` path, which resides | ||
within the user's designated storage bucket. | ||
Within the `_delta_log` directory, you will find the following components: | ||
1. The last checkpoint (or the initial log entry if no checkpoint has been established yet). | ||
2. All the log entries that have been recorded since that last checkpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: With this structure, the number of exported files is not tightly coupled with the number of changes to the table. Assuming N is the number of log entries between checkpoints, adding a single file to the table can result in N + 1 + 1
objects to export. Assuming N=10
, we should be okay (but worth mentioning in the docs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would be N + 1
objects at most (the checkpoint + N json entries)
`${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}`. Following this, it will create an external table | ||
in an existing `catalog.schema` within the Unity catalog, using the Databricks API, the provided | ||
`_lakefs_tables/<table>.yaml` definitions by the user, and specifying the location where the Delta Log was exported to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many external tables can Unity realistically support? The documentation for Delta Share was 1k, although we managed to get to 100k until we had issues. This means that 100k isn't a good answer either. Should we also consider deregistering tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The limitations are:
- 10_000 schemas per catalog
- 10_000 tables per schema
That means we can support 10_000 tables per branch.
Was the problem the total amount of tables repo-wise? If so, this solution allows a total of 100,000,000 tables per catalog (repo).
2. Utilizing the table names configured for this hook, such as `['my-table', 'my-other-table']`, establish or replace external | ||
tables within the Unity catalog and schema (both of which are provided in the hook's configuration). Ensure that you use | ||
the field names and data types as specified in the `_lakefs_tables/my-table.yaml` and `_lakefs_tables/my-other-table.yaml` files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the table name concept. Will running the hook over dev
replace the main
table? Shouldn't they have a different schema or table name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the schema itself will represent the branch, thus if the schema doesn't exist, it will be created.
Adding.
- Authentication with Databricks will require a [service principle](https://docs.databricks.com/en/dev-tools/service-principals.html) | ||
and an associated token to be provided to the hook's configurations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the token lifetime? It will be passed by an env var, which means lakeFS would have to at least restart. If the token lifetime is short, it harms the usability of this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can configure the token lifetime when creating it. Yes that means that when the token is expired it should be replaced. We can start by supplying the token in the hook's configurations, that way lakeFS wouldn't need to be restarted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would the token rotation look like if it's in the hook's configuration? Editing that file across all branches? Not sure it's the best option either.
|
||
- Authentication with Databricks will require a [service principle](https://docs.databricks.com/en/dev-tools/service-principals.html) | ||
and an associated token to be provided to the hook's configurations. | ||
- The users will supply existing catalog and schema under which the table will be created using the [Databricks Go SDK](https://docs.databricks.com/en/dev-tools/sdk-go.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. They will use the SDK to supply the schema and table? Isn't this a Lua hook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The users will supply only the catalog.
I meant that eventually the hook will use the Databricks SDK to create the table and schema.
Fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
This is a minimal design that reflects my intentions for the implementation of both the Delta Lake and Unity exporters.
This is based on the catalog exports issue.