Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Glue Exporter and related #6823

Merged
merged 48 commits into from
Oct 26, 2023
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
c1cc898
glue client lua docs
Isan-Rivkin Oct 19, 2023
15b7545
symlink exporter docs
Isan-Rivkin Oct 19, 2023
ffb4f3d
added export hooks full docs
Isan-Rivkin Oct 19, 2023
82be2e9
fixes
Isan-Rivkin Oct 19, 2023
02875d3
fixes
Isan-Rivkin Oct 19, 2023
94f8225
fixes
Isan-Rivkin Oct 19, 2023
2b401b9
tab fixes
Isan-Rivkin Oct 19, 2023
d0098c8
indent
Isan-Rivkin Oct 19, 2023
f5bef53
update glue exporter
Isan-Rivkin Oct 19, 2023
c91bd6b
initial tutorial
Isan-Rivkin Oct 22, 2023
cad6acc
Added docs
Isan-Rivkin Oct 22, 2023
8351cae
fix link
Isan-Rivkin Oct 22, 2023
3d5cb91
update docs
Isan-Rivkin Oct 22, 2023
3776994
fix broken anchors
Isan-Rivkin Oct 22, 2023
4d6b01b
fix redirects
Isan-Rivkin Oct 22, 2023
0e0bc71
fix docs
Isan-Rivkin Oct 22, 2023
d09808d
WIP
Isan-Rivkin Oct 22, 2023
ad41921
pending
Isan-Rivkin Oct 23, 2023
3a8638b
added table
Isan-Rivkin Oct 23, 2023
2d65f56
indent
Isan-Rivkin Oct 23, 2023
bc062be
fix link
Isan-Rivkin Oct 23, 2023
296f31f
Add imag
Isan-Rivkin Oct 23, 2023
34b6699
docs
Isan-Rivkin Oct 23, 2023
1399999
update location
Isan-Rivkin Oct 23, 2023
01aed58
fix infinite loop redirect
Isan-Rivkin Oct 23, 2023
6729b4e
final examples of scripts
Isan-Rivkin Oct 23, 2023
b666433
final docs
Isan-Rivkin Oct 23, 2023
f3f4502
final
Isan-Rivkin Oct 23, 2023
deab3f2
Update docs/howto/catalog_exports.md
Isan-Rivkin Oct 24, 2023
0c896f4
Update docs/howto/catalog_exports.md
Isan-Rivkin Oct 24, 2023
232b781
Update docs/howto/catalog_exports.md
Isan-Rivkin Oct 24, 2023
ac70832
fix partial comments
Isan-Rivkin Oct 24, 2023
d6f0e31
Update docs/integrations/glue_metastore.md
Isan-Rivkin Oct 25, 2023
a38480a
Update docs/integrations/glue_metastore.md
Isan-Rivkin Oct 25, 2023
3a5788c
Update docs/integrations/glue_metastore.md
Isan-Rivkin Oct 25, 2023
0ae1d76
Update docs/howto/catalog_exports.md
Isan-Rivkin Oct 25, 2023
d000484
Update docs/howto/catalog_exports.md
Isan-Rivkin Oct 25, 2023
f5f4406
Update docs/howto/catalog_exports.md
Isan-Rivkin Oct 25, 2023
2bb6fc6
Update docs/howto/catalog_exports.md
Isan-Rivkin Oct 25, 2023
ac45af7
Update docs/howto/catalog_exports.md
Isan-Rivkin Oct 25, 2023
51b14a0
comments review fixes
Isan-Rivkin Oct 25, 2023
6559100
update fixes
Isan-Rivkin Oct 25, 2023
6719ba5
Update docs/integrations/glue_metastore.md
Isan-Rivkin Oct 25, 2023
f8c337d
Update docs/integrations/glue_metastore.md
Isan-Rivkin Oct 26, 2023
6d4a022
Update docs/integrations/glue_metastore.md
Isan-Rivkin Oct 26, 2023
9200b8a
fixes
Isan-Rivkin Oct 26, 2023
0265cb9
fix
Isan-Rivkin Oct 26, 2023
a0c0cc9
remove example
Isan-Rivkin Oct 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/csv_export_hooks_data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/glue_export_hook_result_log.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
142 changes: 142 additions & 0 deletions docs/howto/catalog_exports.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: Data Catalogs Export
description: This section explains how lakeFS can integrate with external Data Catalogs via metastore update operations.
parent: How-To
---

# Data Catalogs Export

{% include toc_2-3.html %}

## About Data Catalogs Export

Data Catalog Export is all about integrating query engines (like Spark, AWS Athena, Presto, etc.) with lakeFS.

Data Catalogs (such as Hive Metastore or AWS Glue) store metadata for services (such as Spark, Trino and Athena). They contain metadata such as the location of the table, information about columns, partitions and much more.

With Data Catalog Exports, one can leverage the versioning capabilities of lakeFS in external data warehouses and query engines to access tables with branches and commits.

At the end of this guide, you will be able to query lakeFS data from Athena, Trino and other catalog-dependent tools:

```sql
USE main;
USE my_branch; -- any branch
USE v101; -- or tag

SELECT * FROM users
INNER JOIN events
ON users.id = events.user_id; -- SQL stays the same, branch or tag exist as schema
```

## How it works

Several well known formats exist today let you export existing tables in lakeFS into a "native" object store representation
which does not require copying the data outside of lakeFS.

These are metadata representations and can be applied automatically through hooks.

### Table Decleration

After creating a lakeFS repository, configure tables as table descriptor objects on the repository on the path `_lakefs_tables/TABLE.yaml`.
Note: the Glue exporter can currently only export tables of `type: hive`. We expect to add more.

#### Hive tables

Hive metadata server tables are essentially just a set of objects that share a prefix, with no table metadata stored on the object store. You need to configure prefix, partitions, and schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we already define "hive tables" in our docs elsewhere. Can we unify these definitions? (Fine to open an issue to do this in a future PR)

Copy link
Contributor Author

@Isan-Rivkin Isan-Rivkin Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: #6863

You're right but the current definition is so slim 1 sentence and an example so I prefer not to unify this currently since it's really a per-case definition.
I'll open an issue because I prefer to have a more mature and organized definition when doing that and not ad-hoc in this PR.


```yaml
name: animals
type: hive
path: path/to/animals/
partition_columns: ['year']
schema:
type: struct
fields:
- name: year
type: integer
nullable: false
metadata: {}
- name: page
type: string
nullable: false
metadata: {}
- name: site
type: string
nullable: true
metadata:
comment: a comment about this column
```

Useful types recognized by Hive include `integer`, `long`, `short`, `string`, `double`, `float`, `date`, and `timestamp`.
{: .note }

### Catalog Exporters

Exporters are code packages accessible through [Lua integration]({% link howto/hooks/lua.md %}#lua-library-reference). Each exporter is exposed as a Lua function under the package namespace `lakefs/catalogexport`. Call them from hooks to connect lakeFS tables to various catalogs.

#### Currently supported exporters

- Symlink Exporter: Writes metadata for the table using Hive's [SymlinkTextInputFormat](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html)
- AWS Glue Catalog (+ Athena) Exporter: Creates a table in Glue using Hive's format and updates the location to symlink files (reuses Symlink Exporter).
- See a step by step guide on how to integrate with [Glue Exporter]({% link integrations/glue_metastore.md %})

#### Running an Exporter

Exporters are meant to run as [Lua hooks]({% link howto/hooks/lua.md %}).

Configure the actions trigger by using [events and branches]({% link howto/hooks/index.md %}#action-file-schema). Of course, you can add additional custom filtering logic to the Lua script if needed.
The default table name when exported is `${repository_id}_${_lakefs_tables/TABLE.md(name field)}_${ref_name}_${short_commit}`.

Example of an action that will be triggered when a `post-commit` event happens in the `export_table` branch.

```yaml
name: Glue Table Exporter
description: export my table to glue
on:
post-commit:
branches: ["export_table"]
hooks:
- id: my_exporter
type: lua
properties:
# exporter script location
script_path: "scripts/my_export_script.lua"
args:
# table descriptor
table_source: '_lakefs_tables/my_table.yaml'
```

Tip: Actions can be extended to customize any desired behavior, for example validating branch names since they are part of the table name:

```yaml
# _lakefs_actions/validate_branch_name.yaml
name: validate-lower-case-branches
on:
pre-create-branch:
hooks:
- id: check_branch_id
type: lua
properties:
script: |
regexp = require("regexp")
if not regexp.match("^[a-z0-9\\_\\-]+$", action.branch_id) then
error("branches must be lower case, invalid branch ID: " .. action.branch_id)
end
```

### Flow

The following diagram demonstrates what happens when a lakeFS Action triggers runs a lua hook that calls an exporter.

```mermaid
sequenceDiagram
note over Lua Hook: lakeFS Action trigger. <br> Pass Context for the export.
Lua Hook->>Exporter: export request
note over Table Registry: _lakefs_tables/TABLE.yaml
Exporter->>Table Registry: Get table descriptor
Table Registry->>Exporter: Parse table structure
Exporter->>Object Store: materialize an exported table
Exporter->>Catalog: register object store location
Query Engine-->Catalog: Query
Query Engine-->Object Store: Query
```
193 changes: 193 additions & 0 deletions docs/howto/hooks/lua.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,55 @@ or:

Deletes all objects under the given prefix

### `aws/glue`

Glue client library.

```lua
local aws = require("aws")
-- pass valid AWS credentials
local glue = aws.glue_client("ACCESS_KEY_ID", "SECRET_ACCESS_KEY", "REGION")
```

### `aws/glue.get_table(database, table [, catalog_id)`

Describe a table from the Glue catalog.

Example:

```lua
local table, exists = glue.get_table(db, table_name)
if exists then
print(json.marshal(table))
```

### `aws/glue.create_table(database, table_input, [, catalog_id])`

Create a new table in Glue Catalog.
The `table_input` argument is a JSON that is passed "as is" to AWS and is parallel to the AWS SDK [TableInput](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateTable.html#API_CreateTable_RequestSyntax)

Example:

```lua
local json = require("encoding/json")
local input = {
Name = "my-table",
PartitionKeys = array(partitions),
-- etc...
}
local json_input = json.marshal(input)
glue.create_table("my-db", table_input)
```

### `aws/glue.update_table(database, table_input, [, catalog_id, version_id, skip_archive])`

Update an existing Table in Glue Catalog.
The `table_input` is the same as the argument in `glue.create_table` function.

### `aws/glue.delete_table(database, table_input, [, catalog_id])`

Delete an existing Table in Glue Catalog.

### `crypto/aes/encryptCBC(key, plaintext)`

Returns a ciphertext for the aes encrypted text
Expand Down Expand Up @@ -274,6 +323,150 @@ Returns 2 values:

Returns an object-wise diff of uncommitted changes on `branch_id`.

### `lakefs/catalogexport/table_extractor`

Utility package to parse `_lakefs_tables/` descriptors.

### `lakefs/catalogexport/table_extractor.list_table_descriptor_entries(client, repo_id, commit_id)`

List all YAML files under `_lakefs_tables/*` and return a list of type `[{physical_address, path}]`, ignores hidden files.
The `client` is `lakefs` client.

### `lakefs/catalogexport/table_extractor.get_table_descriptor(client, repo_id, commit_id, logical_path)`

Read a table descriptor and parse YAML object. Will set `partition_columns` to `{}` if no partitions are defined.
The `client` is `lakefs` client.

### `lakefs/catalogexport/hive.extract_partition_pager(client, repo_id, commit_id, base_path, partition_cols, page_size)`

Hive format partition iterator each result set is a collection of files under the same partition in lakeFS.

Example:

```lua
local lakefs = require("lakefs")
local pager = hive.extract_partition_pager(lakefs, repo_id, commit_id, prefix, partitions, 10)
for part_key, entries in pager do
print("partition: " .. part_key)
for _, entry in ipairs(entries) do
print("path: " .. entry.path .. " physical: " .. entry.physical_address)
end
end
```

### `lakefs/catalogexport/symlink_exporter`

Writes metadata for a table using Hive's [SymlinkTextInputFormat](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html).
Currently only `S3` is supported.

The default export paths per commit:

```
${storageNamespace}
_lakefs/
exported/
${ref}/
${commitId}/
${tableName}/
p1=v1/symlink.txt
p1=v2/symlink.txt
p1=v3/symlink.txt
...
```

### `lakefs/catalogexport/symlink_exporter.export_s3(s3_client, table_src_path, action_info [, options])`

Export Symlink files that represent a table to S3 location.

Parameters:

- `s3_client`: Configured client.
- `table_src_path(string)`: Path to the table spec YAML file in `_lakefs_tables` (i.e _lakefs_tables/my_table.yaml).
- `action_info(table)`: The global action object.
- `options(table)`:
- `debug(boolean)`: Print extra info.
- `export_base_uri(string)`: Override the prefix in S3 i.e `s3://other-bucket/path/`.
- `writer(function(bucket, key, data))`: If passed then will not use s3 client, helpful for debug.

Example:

```lua
local exporter = require("lakefs/catalogexport/symlink_exporter")
local aws = require("aws")
-- args are user inputs from a lakeFS action.
local s3 = aws.s3_client(args.aws.aws_access_key_id, args.aws.aws_secret_access_key, args.aws.aws_region)
exporter.export_s3(s3, args.table_descriptor_path, action, {debug=true})
```

### `lakefs/catalogexport/symlink_exporter.get_storage_uri_prefix(storage_ns, commit_id, action_info)`

Generate prefix for Symlink file(s) structure that represents a `ref` and a `commit` in lakeFS.
The output pattern `${storage_ns}_lakefs/exported/${ref}/${commit_id}/`.
The `ref` is deduced from the action event in `action_info` (i.e branch name).


### `lakefs/catalogexport/glue_exporter`

A Package for automating the export process from lakeFS stored tables into Glue catalog.

### `lakefs/catalogexport/glue_exporter.export_glue(glue, db, table_src_path, create_table_input, action_info, options)`

Represent lakeFS table in Glue Catalog.
This function will create a table in Glue based on configuration.
It assumes that there is a symlink location that is already created and only configures it by default for the same commit.

Parameters:

- `glue`: AWS glue client
- `db(string)`: glue database name
- `table_src_path(string)`: path to table spec (i.e _lakefs_tables/my_table.yaml)
- `create_table_input(Table)`: Input equal mapping to [table_input](https://docs.aws.amazon.com/glue/latest/webapAPI_CreateTable.html#API_CreateTable_RequestSyntax) in AWS, the same as we use for `glue.create_table`.
should contain inputs describing the data format (i.e InputFormat, OutputFormat, SerdeInfo) since the exporter is agnostic to this.
by default this function will configure table location and schema.
- `action_info(Table)`: the global action object.
- `options(Table)`:
- `table_name(string)`: Override default glue table name
- `debug(boolean`
- `export_base_uri(string)`: Override the default prefix in S3 for symlink location i.e s3://other-bucket/path/

When creating a glue table, the final table input will consist of the `create_table_input` input parameter and lakeFS computed defaults that will override it:

- `Name` Gable table name `get_full_table_name(descriptor, action_info)`.
- `PartitionKeys` Partition columns usually deduced from `_lakefs_tables/${table_src_path}`.
- `TableType` = "EXTERNAL_TABLE"
- `StorageDescriptor`: Columns usually deduced from `_lakefs_tables/${table_src_path}`.
- `StorageDescriptor.Location` = symlink_location

Example:

```lua
local aws = require("aws")
local exporter = require("lakefs/catalogexport/glue_exporter")
local glue = aws.glue_client(args.aws_access_key_id, args.aws_secret_access_key, args.aws_region)
-- table_input can be passed as a simple Key-Value object in YAML as an argument from an action, this is inline example:
local table_input = {
StorageDescriptor:
InputFormat: "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat"
OutputFormat: "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat"
SerdeInfo:
SerializationLibrary: "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
Parameters:
classification: "parquet"
EXTERNAL: "TRUE"
"parquet.compression": "SNAPPY"
}
exporter.export_glue(glue, "my-db", "_lakefs_tables/animals.yaml", table_input, action, {debug=true})
```

### `lakefs/catalogexport/glue_exporter.get_full_table_name(descriptor, action_info)`

Generate glue table name.

Parameters:

- `descriptor(Table)`: Object from (i.e _lakefs_tables/my_table.yaml).
- `action_info(Table)`: The global action object.

### `path/parse(path_string)`

Returns a table for the given path string with the following structure:
Expand Down
1 change: 1 addition & 0 deletions docs/howto/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ has_toc: false

* [Import](/howto/import.html) and [Export Data](/howto/export.html) from lakeFS
* [Copy data](/howto/copying.html) to/from lakeFS
* [Using external Data Catalogs](/howto/catalog_exports.html) with data stored on lakeFS
* [Migrating](/howto/migrate-away.html) away from lakeFS

## Actions and Hooks in lakeFS
Expand Down
2 changes: 1 addition & 1 deletion docs/integrations/glue_hive_metastore.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ redirect_from: /using/glue_hive_metastore.html

{: .warning }
**Deprecated Feature:** Having heard the feedback from the community, we are planning to replace the below manual steps with an automated process.
You can read more about it [here](https://github.com/treeverse/lakeFS/issues/6461).
You can read more about it [here]({% link howto/catalog_exports.md %}).

{% include toc_2-3.html %}

Expand Down
Loading
Loading