Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta Lake Catalog Exporter #7078

Merged
merged 38 commits into from
Dec 11, 2023
Merged

Conversation

Jonathan-Rosenberg
Copy link
Contributor

Closes #6965

Changes

  1. Change the encoding/json lua library to accept an optional options table. This is in order to pass an indentation if needed.
  2. Implement a delta client for lua which can get a delta lake table.
  3. Pass listeningAddress variable to the actions service. This is to allow hooks to communicate back to the running lakeFS server if they run on the same host, e.g. lua hooks.
  4. Implement the delta_exporter lua function that allows configuring a delta table exporting hook.
  5. Extend the lakeFS lua client with stat_object and get_repo operations.

NOTICE 1

This PR depends on the merging of treeverse/delta-go#2.

NOTICE 2

Documentation will be added in a different PR.

How was this tested?

Manually tested on the following Delta Lake log constructs:

  1. A delta log with no checkpoints (i.e. starts with entry 000000000000000000.json).
  2. A delta log with a checkpoint and past logs (e.g. 000000000000000004.json and all other json entries up to and including checkpoint 00000000000000000010.checkpoint.parquet, and entry 00000000000000000010.json)
  3. A delta log with a checkpoint and no past logs (e.g. starting from 00000000000000000010.checkpoint.parquet, followed by a number of log entries).

@Jonathan-Rosenberg Jonathan-Rosenberg added the include-changelog PR description should be included in next release changelog label Nov 29, 2023
@Jonathan-Rosenberg Jonathan-Rosenberg requested review from nopcoder, ozkatz, itaiad200 and Isan-Rivkin and removed request for ozkatz November 29, 2023 15:19
@Jonathan-Rosenberg Jonathan-Rosenberg requested review from johnnyaug and removed request for nopcoder November 29, 2023 16:34
examples/hooks/delta_lake_S3_export.lua Show resolved Hide resolved
pkg/actions/lua/formats/delta.go Outdated Show resolved Hide resolved
pkg/actions/lua/formats/delta.go Outdated Show resolved Hide resolved
if ns == nil then
error("failed getting storage namespace for repo " .. repo)
end
local delta = formats.delta_client("s3", lakefs_key, lakefs_secret, region)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully in the context yet but I can't help not wonder already.
Is there a reason to depend on passing lakeFS credentials here?

Until now we worked with the policy that lua lakeFS client is using the context of the logged in user in the context during runtime to authenticate with lakeFS api.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned F2F, this is due to the need to supply the delta-go package's client AWS key and secret to communicate with the S3 gateway.
On top of that, the administrator might configure a user that has permission to read the Delta table from its location such that every commit will export it even if the user that committed it doesn't have the right permissions.

Copy link
Contributor

@Isan-Rivkin Isan-Rivkin Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% 👍
Should document (in general I know not in this PR, what lakeFS permissions are required)
The technical problem reusing same technique as lakeFS lua client is because of the creds provider that expects credentials. 😢

CredsProvider: aws.CredentialsProviderFunc(func(ctx context.Context) (aws.Credentials, error) {

pkg/actions/kv_run_results_iterator_test.go Show resolved Hide resolved
pkg/actions/service.go Outdated Show resolved Hide resolved
pkg/actions/lua/stdlib.go Outdated Show resolved Hide resolved
Comment on lines 69 to 82
local function sortedKeys(query, sortFunction)
local keys, len = {}, 0
for k,_ in pairs(query) do
len = len + 1
keys[len] = k
end
if sortFunction ~= nil then
table.sort(keys, sortFunction)
else
table.sort(keys)
end

return keys
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the sort function needed here, don't see its being used? If it's not needed let's remove it.
If you prefer to keep it then no need to check for nil:

Suggested change
local function sortedKeys(query, sortFunction)
local keys, len = {}, 0
for k,_ in pairs(query) do
len = len + 1
keys[len] = k
end
if sortFunction ~= nil then
table.sort(keys, sortFunction)
else
table.sort(keys)
end
return keys
end
local function sortedKeys(query, sortFunction)
local sort = sortFunction or
local keys, len = {}, 0
for k,_ in pairs(query) do
len = len + 1
keys[len] = k
end
table.sort(keys, sortFunction)
return keys
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in delta_exporter.
Not sure I understand the suggested change, but I do agree that the nil check is redundant. 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant specifically the sortFunction argument in the function, I don't see it being used anywhere?
If you prefer to leave it regardless, please add a comment on the sortFunction expected signature otherwise it'll be hard to use it without going into the implementation of sortedKeys

Copy link
Contributor Author

@Jonathan-Rosenberg Jonathan-Rosenberg Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sortFunction is passed to table.sort to apply the sorting logic (otherwise it uses default lexicographic sort). I don't need to pass, but since it's a util function, the API should suggest a way to sort keys.
I'll add a comment.

pkg/actions/lua/lakefs/catalogexport/delta_exporter.lua Outdated Show resolved Hide resolved
pkg/actions/lua/lakefs/catalogexport/delta_exporter.lua Outdated Show resolved Hide resolved
pkg/actions/lua/lakefs/catalogexport/storage_utils.lua Outdated Show resolved Hide resolved
Copy link
Contributor

@Isan-Rivkin Isan-Rivkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review WIP - if you prefer all at once LMK 🙏

pkg/actions/lua/lakefs/catalogexport/internal.lua Outdated Show resolved Hide resolved
pkg/actions/lua/lakefs/catalogexport/delta_exporter.lua Outdated Show resolved Hide resolved
pkg/actions/lua/lakefs/catalogexport/delta_exporter.lua Outdated Show resolved Hide resolved
pkg/actions/lua/lakefs/client.go Outdated Show resolved Hide resolved
Copy link
Contributor

@Isan-Rivkin Isan-Rivkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General review comment - what is the plan about tests?

  1. Reference to glue and symlink exporter test integration in esti - can we add a test here?

  2. Lua package unit testing - see reference hive_partition_pager.lua test

pkg/actions/lua/lakefs/catalogexport/delta_exporter.lua Outdated Show resolved Hide resolved
pkg/actions/lua/formats/delta.go Outdated Show resolved Hide resolved
pkg/actions/lua/formats/delta.go Show resolved Hide resolved
pkg/actions/lua/formats/delta.go Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Dec 3, 2023

♻️ PR Preview 2b9f8a2 has been successfully destroyed since this PR has been closed.

🤖 By surge-preview

Copy link
Contributor Author

@Jonathan-Rosenberg Jonathan-Rosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Isan-Rivkin, tests will be added on a different PR.
Thank you 🙏

pkg/actions/lua/formats/delta.go Show resolved Hide resolved
pkg/actions/lua/stdlib.go Outdated Show resolved Hide resolved
pkg/actions/lua/lakefs/catalogexport/delta_exporter.lua Outdated Show resolved Hide resolved
pkg/actions/lua/formats/delta.go Outdated Show resolved Hide resolved
Copy link
Contributor

@Isan-Rivkin Isan-Rivkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
I left 2 comments un-resolved as reminders incase it's missed - feel free resolve.
Thank you for taking the time and addressing the comments!
Can't wait to export to Delta 🕺

@Jonathan-Rosenberg Jonathan-Rosenberg merged commit bb38ba0 into master Dec 11, 2023
35 checks passed
@Jonathan-Rosenberg Jonathan-Rosenberg deleted the feature/catalog-export/delta branch December 11, 2023 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delta Catalog exporter
2 participants