Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Design] Delta and Unity catalog exporters implementation #6943

Merged
merged 2 commits into from
Nov 9, 2023

Conversation

Jonathan-Rosenberg
Copy link
Contributor

This is a minimal design that reflects my intentions for the implementation of both the Delta Lake and Unity exporters.

This is based on the catalog exports issue.

@Jonathan-Rosenberg Jonathan-Rosenberg added exclude-changelog PR description should not be included in next release changelog minor-change Used for PRs that don't require issue attached labels Nov 6, 2023
@Jonathan-Rosenberg Jonathan-Rosenberg requested review from ozkatz and a team November 6, 2023 17:11
@Jonathan-Rosenberg Jonathan-Rosenberg changed the title Delta and Unity catalog exporters implementation design [Design] Delta and Unity catalog exporters implementation Nov 6, 2023
Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked this design a lot, thank you for doing it.
Adding questions inline

design/open/delta-catalog-exporter.md Show resolved Hide resolved
Comment on lines +19 to +24
Following the [catalog exports issue](https://github.com/treeverse/lakeFS/issues/6461), the Delta Lake log will be
exported to the `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}/_delta_log/` path, which resides
within the user's designated storage bucket.
Within the `_delta_log` directory, you will find the following components:
1. The last checkpoint (or the initial log entry if no checkpoint has been established yet).
2. All the log entries that have been recorded since that last checkpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: With this structure, the number of exported files is not tightly coupled with the number of changes to the table. Assuming N is the number of log entries between checkpoints, adding a single file to the table can result in N + 1 + 1 objects to export. Assuming N=10, we should be okay (but worth mentioning in the docs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it would be N + 1 objects at most (the checkpoint + N json entries)

Comment on lines +16 to +18
`${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}`. Following this, it will create an external table
in an existing `catalog.schema` within the Unity catalog, using the Databricks API, the provided
`_lakefs_tables/<table>.yaml` definitions by the user, and specifying the location where the Delta Log was exported to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many external tables can Unity realistically support? The documentation for Delta Share was 1k, although we managed to get to 100k until we had issues. This means that 100k isn't a good answer either. Should we also consider deregistering tables?

Copy link
Contributor Author

@Jonathan-Rosenberg Jonathan-Rosenberg Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limitations are:

  • 10_000 schemas per catalog
  • 10_000 tables per schema

That means we can support 10_000 tables per branch.
Was the problem the total amount of tables repo-wise? If so, this solution allows a total of 100,000,000 tables per catalog (repo).

Comment on lines 23 to 25
2. Utilizing the table names configured for this hook, such as `['my-table', 'my-other-table']`, establish or replace external
tables within the Unity catalog and schema (both of which are provided in the hook's configuration). Ensure that you use
the field names and data types as specified in the `_lakefs_tables/my-table.yaml` and `_lakefs_tables/my-other-table.yaml` files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the table name concept. Will running the hook over dev replace the main table? Shouldn't they have a different schema or table name?

Copy link
Contributor Author

@Jonathan-Rosenberg Jonathan-Rosenberg Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the schema itself will represent the branch, thus if the schema doesn't exist, it will be created.
Adding.

Comment on lines +29 to +30
- Authentication with Databricks will require a [service principle](https://docs.databricks.com/en/dev-tools/service-principals.html)
and an associated token to be provided to the hook's configurations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the token lifetime? It will be passed by an env var, which means lakeFS would have to at least restart. If the token lifetime is short, it harms the usability of this feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can configure the token lifetime when creating it. Yes that means that when the token is expired it should be replaced. We can start by supplying the token in the hook's configurations, that way lakeFS wouldn't need to be restarted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the token rotation look like if it's in the hook's configuration? Editing that file across all branches? Not sure it's the best option either.


- Authentication with Databricks will require a [service principle](https://docs.databricks.com/en/dev-tools/service-principals.html)
and an associated token to be provided to the hook's configurations.
- The users will supply existing catalog and schema under which the table will be created using the [Databricks Go SDK](https://docs.databricks.com/en/dev-tools/sdk-go.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. They will use the SDK to supply the schema and table? Isn't this a Lua hook?

Copy link
Contributor Author

@Jonathan-Rosenberg Jonathan-Rosenberg Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The users will supply only the catalog.
I meant that eventually the hook will use the Databricks SDK to create the table and schema.
Fixing.

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@Jonathan-Rosenberg Jonathan-Rosenberg enabled auto-merge (squash) November 9, 2023 10:15
@Jonathan-Rosenberg Jonathan-Rosenberg merged commit 734405b into master Nov 9, 2023
29 checks passed
@Jonathan-Rosenberg Jonathan-Rosenberg deleted the design/impl/delta-unity-catalog-exporter branch November 9, 2023 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exclude-changelog PR description should not be included in next release changelog minor-change Used for PRs that don't require issue attached
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants