Skip to content

Commit

Permalink
Delta Lake Catalog Exporter (#7078)
Browse files Browse the repository at this point in the history
* delta client impl

* delta exporter

* Use the earliest available version to build the Delta log

* pass listen address to hooks

* add stat_object api call to the lakefs lua client

* add get_repo to eventually fetch its storage namespace

* delta lake exporter example

* add error return. remove redundant lines. add godocs for some funcitons

* extract get_storage_namespace to the internal lua file.
change response of delta_exporter, remove prints, remove old impl

* remove comment

* go mod tidy

* pass aws properties directly to fetchS3Table

* json encoding refactor

* fix load test

* fix tests

* fix tests

* goimports

* add indentation to lua test

* remove storage_utils.lua

* Rename `ListeningAddress`  (and `listeningAddress`) to `ServerAddress`.
Lua refactoring
Remove unneeded `listeningAddress` member from `DeltaClient`

* fix test

* lint

* pass a delta client to the export_delta function
lua-require packages

* upgrade aws-sdk-v2

* go mod tidy

* use get_storage_prefix from the internal package. Remove it from docs

* remove get_storage_namespace and the `get_repo` method from lakeFS client since the storage namespace can be fetched from the global action object

* use get_storage_prefix from the internal package

* use correct error when returning

* comment the delta_log_entry_key_generator function and return a complete key name (including ".json" suffix)

* use a writer_client function instead of a storage client when passed to the export_delta_log function

* updated version of delta go

* severAddress -> lakeFSAddr + comment

* change delta client builder API and drop storage type (since always using the S3 gateway). Use regex to validate lakeFS address

* rename the lakeFS address field and add clarifying comment
  • Loading branch information
Jonathan-Rosenberg authored Dec 11, 2023
1 parent 6c1448c commit bb38ba0
Show file tree
Hide file tree
Showing 26 changed files with 714 additions and 237 deletions.
1 change: 1 addition & 0 deletions cmd/lakefs/cmd/run.go
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ var runCmd = &cobra.Command{
idGen,
bufferedCollector,
actions.Config(cfg.Actions),
cfg.ListenAddress,
)

// wire actions into entry catalog
Expand Down
7 changes: 0 additions & 7 deletions docs/howto/hooks/lua.md
Original file line number Diff line number Diff line change
Expand Up @@ -403,13 +403,6 @@ local s3 = aws.s3_client(args.aws.aws_access_key_id, args.aws.aws_secret_access_
exporter.export_s3(s3, args.table_descriptor_path, action, {debug=true})
```

### `lakefs/catalogexport/symlink_exporter.get_storage_uri_prefix(storage_ns, commit_id, action_info)`

Generate prefix for Symlink file(s) structure that represents a `ref` and a `commit` in lakeFS.
The output pattern `${storage_ns}_lakefs/exported/${ref}/${commit_id}/`.
The `ref` is deduced from the action event in `action_info` (i.e branch name).


### `lakefs/catalogexport/glue_exporter`

A Package for automating the export process from lakeFS stored tables into Glue catalog.
Expand Down
25 changes: 25 additions & 0 deletions examples/hooks/delta_lake_S3_export.lua
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
--[[
action_args:
- repo
- commit_id
export_delta_args:
- table_paths: ["path/to/table1", "path/to/table2", ...]
- lakefs_key
- lakefs_secret
- region
storage_client:
- put_object: function(bucket, key, data)
]]
local aws = require("aws")
local formats = require("formats")
local delta_export = require("lakefs/catalogexport/delta_exporter")

local sc = aws.s3_client(args.aws.aws_access_key_id, args.aws.aws_secret_access_key, args.aws.aws_region)

local delta_client = formats.delta_client(args.lakefs.access_key_id, args.lakefs.secret_access_key, args.aws.aws_region)
local delta_table_locations = delta_export.export_delta_log(action, args.table_paths, sc.put_object, delta_client)
for t, loc in pairs(delta_table_locations) do
print("Delta Lake exported table \"" .. t .. "\"'s location: " .. loc .. "\n")
end
118 changes: 70 additions & 48 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,23 @@ module github.com/treeverse/lakefs
go 1.21

require (
cloud.google.com/go v0.110.8 // indirect
cloud.google.com/go/storage v1.33.0
cloud.google.com/go v0.111.0 // indirect
cloud.google.com/go/storage v1.35.1
github.com/apache/thrift v0.19.0
github.com/cockroachdb/pebble v0.0.0-20230106151110-65ff304d3d7a
github.com/cubewise-code/go-mime v0.0.0-20200519001935-8c5762b177d8
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc
github.com/deepmap/oapi-codegen v1.5.6
github.com/dgraph-io/ristretto v0.1.1
github.com/fsnotify/fsnotify v1.6.0
github.com/fsnotify/fsnotify v1.7.0
github.com/getkin/kin-openapi v0.53.0
github.com/go-chi/chi/v5 v5.0.10
github.com/go-openapi/swag v0.19.14
github.com/go-test/deep v1.1.0
github.com/gobwas/glob v0.2.3
github.com/golang/mock v1.6.0
github.com/golang/protobuf v1.5.3 // indirect
github.com/google/uuid v1.3.1
github.com/google/uuid v1.4.0
github.com/hashicorp/go-multierror v1.1.1
github.com/hnlq715/golang-lru v0.3.0
github.com/jamiealquiza/tachymeter v2.0.0+incompatible
Expand All @@ -45,42 +45,43 @@ require (
github.com/vbauerster/mpb/v5 v5.4.0
github.com/xitongsys/parquet-go v1.6.2
github.com/xitongsys/parquet-go-source v0.0.0-20230607234618-40034c8066df
golang.org/x/crypto v0.14.0
golang.org/x/oauth2 v0.13.0
golang.org/x/term v0.13.0
google.golang.org/api v0.147.0
golang.org/x/crypto v0.16.0
golang.org/x/oauth2 v0.15.0
golang.org/x/term v0.15.0
google.golang.org/api v0.152.0
google.golang.org/protobuf v1.31.0
gopkg.in/natefinch/lumberjack.v2 v2.0.0
gopkg.in/yaml.v3 v3.0.1
)

require (
cloud.google.com/go/compute v1.23.0 // indirect
golang.org/x/sync v0.4.0
cloud.google.com/go/compute v1.23.3 // indirect
golang.org/x/sync v0.5.0
)

require (
cloud.google.com/go/compute/metadata v0.2.3
github.com/Azure/azure-sdk-for-go/sdk/azcore v1.8.0
github.com/Azure/azure-sdk-for-go/sdk/azcore v1.9.0
github.com/Azure/azure-sdk-for-go/sdk/azidentity v1.4.0
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos v0.3.6
github.com/Azure/azure-sdk-for-go/sdk/storage/azblob v1.2.0
github.com/IBM/pgxpoolprometheus v1.1.1
github.com/Shopify/go-lua v0.0.0-20221004153744-91867de107cf
github.com/alitto/pond v1.8.3
github.com/antonmedv/expr v1.15.3
github.com/aws/aws-sdk-go-v2 v1.23.4
github.com/aws/aws-sdk-go-v2/config v1.25.10
github.com/aws/aws-sdk-go-v2/credentials v1.16.8
github.com/aws/aws-sdk-go-v2 v1.23.5
github.com/aws/aws-sdk-go-v2/config v1.25.11
github.com/aws/aws-sdk-go-v2/credentials v1.16.9
github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.12.7
github.com/aws/aws-sdk-go-v2/feature/dynamodb/expression v1.6.7
github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.15.3
github.com/aws/aws-sdk-go-v2/feature/s3/manager v1.15.4
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.26.1
github.com/aws/aws-sdk-go-v2/service/glue v1.71.1
github.com/aws/aws-sdk-go-v2/service/s3 v1.47.1
github.com/aws/aws-sdk-go-v2/service/sts v1.26.1
github.com/aws/aws-sdk-go-v2/service/s3 v1.47.2
github.com/aws/aws-sdk-go-v2/service/sts v1.26.2
github.com/aws/smithy-go v1.18.1
github.com/benburkert/dns v0.0.0-20190225204957-d356cf78cdfc
github.com/csimplestring/delta-go v0.0.0-20231105162402-9b93ca02cedf
github.com/dgraph-io/badger/v4 v4.2.0
github.com/georgysavva/scany/v2 v2.0.0
github.com/go-co-op/gocron v1.35.2
Expand All @@ -98,54 +99,74 @@ require (
)

require (
cloud.google.com/go/iam v1.1.2 // indirect
cloud.google.com/go/iam v1.1.5 // indirect
github.com/Azure/azure-sdk-for-go v68.0.0+incompatible // indirect
github.com/Azure/azure-sdk-for-go/sdk/internal v1.3.0 // indirect
github.com/AzureAD/microsoft-authentication-library-for-go v1.1.1 // indirect
github.com/Azure/azure-sdk-for-go/sdk/internal v1.5.0 // indirect
github.com/Azure/go-autorest v14.2.0+incompatible // indirect
github.com/Azure/go-autorest/autorest/to v0.4.0 // indirect
github.com/AzureAD/microsoft-authentication-library-for-go v1.2.0 // indirect
github.com/ahmetb/go-linq/v3 v3.2.0 // indirect
github.com/apache/arrow/go/arrow v0.0.0-20200730104253-651201b0f516 // indirect
github.com/aws/aws-sdk-go v1.48.11 // indirect
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.5.3 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.14.8 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.2.7 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.5.7 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.14.9 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.2.8 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.5.8 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.7.1 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.2.7 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.2.8 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodbstreams v1.18.1 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.10.3 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.2.7 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.2.8 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.8.7 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.10.7 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.16.7 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.18.1 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.21.1 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.10.8 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.16.8 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.18.2 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.21.2 // indirect
github.com/barweiss/go-tuple v1.1.2 // indirect
github.com/benbjohnson/clock v1.3.0 // indirect
github.com/deckarep/golang-set/v2 v2.5.0 // indirect
github.com/fraugster/parquet-go v0.12.0 // indirect
github.com/getsentry/sentry-go v0.16.0 // indirect
github.com/go-sql-driver/mysql v1.7.0 // indirect
github.com/golang-jwt/jwt/v5 v5.0.0 // indirect
github.com/golang/glog v1.1.0 // indirect
github.com/go-logr/logr v1.3.0 // indirect
github.com/go-logr/stdr v1.2.2 // indirect
github.com/golang-jwt/jwt/v5 v5.2.0 // indirect
github.com/golang/glog v1.1.2 // indirect
github.com/google/flatbuffers v2.0.0+incompatible // indirect
github.com/google/s2a-go v0.1.7 // indirect
github.com/googleapis/enterprise-certificate-proxy v0.3.1 // indirect
github.com/google/wire v0.5.0 // indirect
github.com/googleapis/enterprise-certificate-proxy v0.3.2 // indirect
github.com/hashicorp/go-cleanhttp v0.5.2 // indirect
github.com/hashicorp/yamux v0.0.0-20180604194846-3520598351bb // indirect
github.com/jackc/puddle/v2 v2.2.1 // indirect
github.com/klauspost/cpuid/v2 v2.2.5 // indirect
github.com/kylelemons/godebug v1.1.0 // indirect
github.com/lib/pq v1.10.9 // indirect
github.com/mitchellh/go-testing-interface v0.0.0-20171004221916-a61a99592b77 // indirect
github.com/mitchellh/hashstructure/v2 v2.0.2 // indirect
github.com/oklog/run v1.0.0 // indirect
github.com/pelletier/go-toml/v2 v2.1.0 // indirect
github.com/pierrec/lz4/v4 v4.1.8 // indirect
github.com/pkg/browser v0.0.0-20210911075715-681adbf594b8 // indirect
github.com/repeale/fp-go v0.11.1 // indirect
github.com/robfig/cron/v3 v3.0.1 // indirect
github.com/rogpeppe/go-internal v1.10.0 // indirect
github.com/rotisserie/eris v0.5.4 // indirect
github.com/rs/dnscache v0.0.0-20211102005908-e0241e321417 // indirect
github.com/sagikazarmark/locafero v0.3.0 // indirect
github.com/sagikazarmark/slog-shim v0.1.0 // indirect
github.com/samber/mo v1.11.0 // indirect
github.com/shopspring/decimal v1.3.1 // indirect
github.com/sourcegraph/conc v0.3.0 // indirect
go.uber.org/multierr v1.9.0 // indirect
github.com/ulule/deepcopier v0.0.0-20200430083143-45decc6639b6 // indirect
github.com/xhit/go-str2duration/v2 v2.1.0 // indirect
go.opentelemetry.io/otel v1.21.0 // indirect
go.opentelemetry.io/otel/metric v1.21.0 // indirect
go.opentelemetry.io/otel/trace v1.21.0 // indirect
go.uber.org/multierr v1.11.0 // indirect
gocloud.dev v0.34.1-0.20231122211418-53ccd8db26a1 // indirect
golang.org/x/time v0.5.0 // indirect
gonum.org/v1/gonum v0.9.3 // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20231002182017-d307bd883b97 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20231009173412-8bfb1ae86b6c // indirect
google.golang.org/genproto/googleapis/api v0.0.0-20231127180814-3a041ad873d4 // indirect
google.golang.org/genproto/googleapis/rpc v0.0.0-20231127180814-3a041ad873d4 // indirect
)

require (
Expand Down Expand Up @@ -175,7 +196,6 @@ require (
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect
github.com/golang/snappy v0.0.4 // indirect
github.com/google/go-cmp v0.5.9 // indirect
github.com/google/shlex v0.0.0-20191202100458-e7afc7fbc510 // indirect
github.com/googleapis/gax-go/v2 v2.12.0 // indirect
github.com/hashicorp/errwrap v1.1.0 // indirect
Expand Down Expand Up @@ -218,16 +238,18 @@ require (
github.com/xeipuuv/gojsonschema v1.2.0 // indirect
go.opencensus.io v0.24.0 // indirect
go.uber.org/atomic v1.11.0
golang.org/x/exp v0.0.0-20230905200255-921286631fa9
golang.org/x/mod v0.12.0 // indirect
golang.org/x/net v0.17.0
golang.org/x/sys v0.13.0 // indirect
golang.org/x/text v0.13.0 // indirect
golang.org/x/tools v0.13.0 // indirect
golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2 // indirect
google.golang.org/appengine v1.6.7 // indirect
google.golang.org/genproto v0.0.0-20231002182017-d307bd883b97 // indirect
google.golang.org/grpc v1.58.3
golang.org/x/exp v0.0.0-20231127185646-65229373498e
golang.org/x/mod v0.14.0 // indirect
golang.org/x/net v0.19.0
golang.org/x/sys v0.15.0 // indirect
golang.org/x/text v0.14.0 // indirect
golang.org/x/tools v0.16.0 // indirect
golang.org/x/xerrors v0.0.0-20231012003039-104605ab7028 // indirect
google.golang.org/appengine v1.6.8 // indirect
google.golang.org/genproto v0.0.0-20231127180814-3a041ad873d4 // indirect
google.golang.org/grpc v1.59.0
gopkg.in/ini.v1 v1.67.0 // indirect
gopkg.in/yaml.v2 v2.4.0 // indirect
)

replace github.com/csimplestring/delta-go => github.com/treeverse/delta-go v0.0.0-20231203131847-a5acc36c8ba5
Loading

0 comments on commit bb38ba0

Please sign in to comment.