Skip to content

Commit

Permalink
[Design] Delta and Unity catalog exporters implementation (#6943)
Browse files Browse the repository at this point in the history
* Delta and Unity catalog exporters implementation design

* minor changes
  • Loading branch information
Jonathan-Rosenberg authored Nov 9, 2023
1 parent ff38131 commit 734405b
Show file tree
Hide file tree
Showing 2 changed files with 114 additions and 0 deletions.
83 changes: 83 additions & 0 deletions design/open/delta-catalog-exporter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Delta Lake catalog exporter

## Introduction

The Delta Lake table format manages its catalog of used files through a log-based system. This log, as the name implies,
contains a sequence of deltas representing changes applied to the table. The list of files that collectively represent
the table at a specific log entry is constructed by reapplying the changes stored in the log files one by one, starting
from the last checkpoint (a file that summarizes all changes up to that point) and progressing to the latest log entry.
Each log entry contains either an `add` or `remove` action, which adds or removes a data (parquet) file, ultimately
shaping the structure of the table.

In order to make Delta Lake tables accessible to external users, we aim to export the Delta Lake log to an external
location, enabling these users to read tables backed by lakeFS and Delta Lake.

---

## Proposed Solution

Following the [catalog exports issue](https://github.com/treeverse/lakeFS/issues/6461), the Delta Lake log will be
exported to the `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}/_delta_log/` path, which resides
within the user's designated storage bucket.
Within the `_delta_log` directory, you will find the following components:
1. The last checkpoint (or the initial log entry if no checkpoint has been established yet).
2. All the log entries that have been recorded since that last checkpoint.

Notably, the log entries will mirror those present in lakeFS, with one key distinction: Instead of utilizing relative
logical paths, they will include absolute physical paths:

#### lakeFS-backed Delta Log entry:
```json
{ "commitInfo": {
"timestamp": 1699199369960,
"operation": "WRITE",
"operationParameters": {
"mode": "Overwrite",
"partitionBy": "[]"
},
"readVersion": 2,
...
}
{ "add": {
"path":"part-00000-72b765fd-a97b-4386-b92c-cc582a7ca176-c000.snappy.parquet",
...
}
}
{ "remove": {
"path":"part-00000-56e72a31-0078-459d-a577-ef2c5d3dc0f9-c000.snappy.parquet",
...
}
}
```

#### Exported Delta Log entry:
```json
{ "commitInfo": {
"timestamp": 1699199369960,
"operation": "WRITE",
"operationParameters": {
"mode": "Overwrite",
"partitionBy": "[]"
},
"readVersion": 2,
...
}
{ "add": {
"path":"s3://my-bucket/my-path/data/gk3l4p7nl532qibsgkv0/cl3rj1fnl532qibsglr0",
...
}
}
{ "remove": {
"path":"s3://my-bucket/my-path/data/gk899p7jl532qibsgkv8/zxcrhuvnl532qibshouy",
...
}
}
```

---

We shall use the [delta-go](https://github.com/csimplestring/delta-go) package to read the Delta Lake log since the last
checkpoint (or first entry if none found) and generate the new `_delta_log` directory with log entries as described
above. The directory and log files will be written to `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}/_delta_log/`.
The `tableName` will be fetched from the hook's configurations.
The Delta Lake table can now be read from `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}`.
31 changes: 31 additions & 0 deletions design/open/unity-catalog-exporter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Unity catalog exporter

## Introduction

Currently, due to the limitations of Databricks Unity catalog, which supports only cloud provider direct storage
endpoints and authentication, it's not feasible to configure it to work directly with lakeFS.
We wish to overcome this limitation to enable Unity catalog-backed services to read Delta Lake tables, which is the
default table format used by Databricks, from lakeFS.

---

## Proposed Solution

Following the [catalog exports issue](https://github.com/treeverse/lakeFS/issues/6461), the Unity catalog exporter will
utilize the [Delta Lake catalog exporter](./delta-catalog-exporter.md) to export an existing Delta Lake table to
`${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}`. Following this, it will create an external table
in an existing `catalog.schema` within the Unity catalog, using the Databricks API, the provided
`_lakefs_tables/<table>.yaml` definitions by the user, and specifying the location where the Delta Log was exported to.

### Flow

1. Execute the Delta Lake catalog exporter procedure and retrieve the path to the exported data.
2. Utilizing the table names configured for this hook, such as `['my-table', 'my-other-table']`, establish or replace external
tables within the Unity catalog (which is provided in the hook's configuration) and schema (which will be the branch). Ensure that you use
the field names and data types as specified in the `_lakefs_tables/my-table.yaml` and `_lakefs_tables/my-other-table.yaml` files.

Once the above hook's run completed successfully, the tables could be read form the Databricks Unity catalog backed service.

- Authentication with Databricks will require a [service principle](https://docs.databricks.com/en/dev-tools/service-principals.html)
and an associated token to be provided to the hook's configurations.
- The users will supply an existing catalog under which the schema and table will be created using the [Databricks Go SDK](https://docs.databricks.com/en/dev-tools/sdk-go.html).

0 comments on commit 734405b

Please sign in to comment.