Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta Exporter: Azure Support #7444

Merged
merged 4 commits into from
Feb 13, 2024
Merged

Conversation

N-o-Z
Copy link
Member

@N-o-Z N-o-Z commented Feb 6, 2024

Closes #7296

Change Description

Background

Add support for delta catalog export with Azure

Testing Details

Added new esti test for delta catalog export with AWS and Azure falvors

Breaking Change?

No

@N-o-Z N-o-Z added the include-changelog PR description should be included in next release changelog label Feb 6, 2024
@N-o-Z N-o-Z self-assigned this Feb 6, 2024
Copy link

github-actions bot commented Feb 6, 2024

♻️ PR Preview 1032728 has been successfully destroyed since this PR has been closed.

🤖 By surge-preview

Copy link

github-actions bot commented Feb 6, 2024

E2E Test Results - DynamoDB Local - Local Block Adapter

10 passed

@N-o-Z N-o-Z force-pushed the task/delta-exporter-for-azure-7296 branch 10 times, most recently from ec576ce to d2fdb99 Compare February 8, 2024 06:13
@N-o-Z N-o-Z force-pushed the task/delta-exporter-for-azure-7296 branch from d2fdb99 to f85324a Compare February 8, 2024 06:53
@N-o-Z N-o-Z marked this pull request as ready for review February 8, 2024 09:27
Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always great to see test coverage added, thank you! I enjoyed reading this, few questions and requests for clarifications as I'm not very familiar with our exports.

Comment on lines 105 to 117
// ListObjects Should be implemented when needed. There are nuances between HNS and BlobStorage which requires understanding the
// Actual use case before implementing the solution
func (c *Client) ListObjects(l *lua.State) int {
lua.Errorf(l, "Not implemented")
panic("unreachable")
}

// DeleteRecursive Should be implemented when needed. There are nuances between HNS and BlobStorage which requires understanding the
// Actual use case before implementing the solution
func (c *Client) DeleteRecursive(l *lua.State) int {
lua.Errorf(l, "Not implemented")
panic("unreachable")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how it works, but if this client implements an interface, wouldn't it be failing without the user understand why it was not implemented?
The user wants to export tables, can it work without these 2? Don't we delete objects once we delete a branch for example?

Script: script,
Args: args,
collector: collector,
serverAddress: serverAddress,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lakeFS server address - used to create the lakeFS client in the lua script

@@ -412,18 +412,18 @@ Parameters:

A package used to export Delta Lake tables from lakeFS to an external cloud storage.

### `lakefs/catalogexport/delta_exporter.export_delta_log(action, table_names, writer, delta_client, table_descriptors_path)`
### `lakefs/catalogexport/delta_exporter.export_delta_log(action, table_def_names, write_object, delta_client, table_descriptors_path)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a breaking change? I'm not very familiar with named params in lua

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug in the documentation - the code did not change

Comment on lines +10 to +21
local aws = require("aws")
local formats = require("formats")
local delta_exporter = require("lakefs/catalogexport/delta_exporter")

local table_descriptors_path = "_lakefs_tables"
local sc = aws.s3_client(args.aws.access_key_id, args.aws.secret_access_key, args.aws.region)
local delta_client = formats.delta_client(args.lakefs.access_key_id, args.lakefs.secret_access_key, args.aws.region)
local delta_table_locations = delta_exporter.export_delta_log(action, args.table_names, sc.put_object, delta_client, table_descriptors_path)

for t, loc in pairs(delta_table_locations) do
print("Delta Lake exported table \"" .. t .. "\"'s location: " .. loc .. "\n")
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we copying the script here and not utilizing the higher level lua func?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the comment - can you explain?

Comment on lines 14 to 21
local table_descriptors_path = "_lakefs_tables"
local sc = azure.client(args.azure.storage_account, args.azure.access_key)
local delta_client = formats.delta_client(args.lakefs.access_key_id, args.lakefs.secret_access_key)
local delta_table_locations = delta_exporter.export_delta_log(action, args.table_names, sc.put_object, delta_client, table_descriptors_path)

for t, loc in pairs(delta_table_locations) do
print("Delta Lake exported table \"" .. t .. "\"'s location: " .. loc .. "\n")
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question applies here

end
args:
azure:
storage_account: "{{ .AzureStorageAccount }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a secret? Consider just putting this as plaintext here if it isn't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes it easier to change the storage account name (in a single place) in the code and not in the yaml file

})

runs := waitForListRepositoryRunsLen(ctx, t, repo, headCommit.Id, 1)
require.Equal(t, "completed", runs.Results[0].Status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider checking the run result name or other identifier to make sure you're seeing the correct result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@Isan-Rivkin Isan-Rivkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, that's a big PR so it'll take some time but I added the most meaningful things that I found so far.


// DeleteRecursive Should be implemented when needed. There are nuances between HNS and BlobStorage which requires understanding the
// Actual use case before implementing the solution
func (c *Client) DeleteRecursive(l *lua.State) int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this function, why keep it if it's not implemented? I find it very confusing including docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For conformity we created a client interface, the client objects need to implement that interface to support all the required functionalities.
This is relevant beyond the scope of the delta exporter. I think it is essential to keep the code modular.
In the future when we decide to add support for Hive export in Azure - we will need to implement it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is AWS Client for S3 and Glue, why did you rename s3.go to client.go? Also keeping the glue.go
Let's just keep the files separated with clear names like e.g s3.go.

client *service.Client
}

func newClient(ctx context.Context) lua.Function {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be explicit about that client since it's not general Azure client instead storage client?
We're investing in our lua SDK's not only for export hooks, the azure SDK should be consistent with AWS where we have glue, s3 etc.
Now it's easy to change, later on when we want to add some outside of azure storage functionality this will get confusing very fast.

DeleteRecursive(l *lua.State) int
}

func InitStorageClient(l *lua.State, client Client) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we don't need this abstraction, I fear it's too early and make things more complex.

To make it easier to use with exporter like export_delta_log you can use the function param write_object and pass a wrapper function for azure storage client write_object: function(bucket, key, data) it doesn't need to be on the SDK level.

  1. GCS is not included here
  2. In S3/Azure the interface has 2 missing functions (delete_recursive, list_objects missing)
  3. PutObject(l *lua.State) int is not not matching in both cases (referring to implementation with comment // Skipping first argument as it is the host arg (bucket which is irrelevant in Azure) in azure that you added.).
  4. Those SDK's are not meant to be used by export hooks only, our users should be able to use those SDK independently for their own needs, the combination of them here doesn't seem friendly to user experience.

@N-o-Z N-o-Z requested a review from Isan-Rivkin February 12, 2024 13:21
Copy link
Contributor

@Isan-Rivkin Isan-Rivkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing previous comment!
Added some comments, overall looks like a very useful addition :)

pkg/actions/lua/lakefs/catalogexport/table_extractor.lua Outdated Show resolved Hide resolved
docs/howto/hooks/lua.md Outdated Show resolved Hide resolved
pkg/actions/lua/storage/azure/azure.go Show resolved Hide resolved
@N-o-Z N-o-Z requested a review from Isan-Rivkin February 13, 2024 14:33
Copy link
Contributor

@Isan-Rivkin Isan-Rivkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Can't wait to use it!

@N-o-Z N-o-Z enabled auto-merge (squash) February 13, 2024 14:40
@N-o-Z N-o-Z merged commit ab22684 into master Feb 13, 2024
36 checks passed
@N-o-Z N-o-Z deleted the task/delta-exporter-for-azure-7296 branch February 13, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Catalog Export Hooks for Unity to support Azure
3 participants