Skip to content

Commit

Permalink
Docs: Glue Exporter and related (#6823)
Browse files Browse the repository at this point in the history
  • Loading branch information
Isan-Rivkin authored Oct 26, 2023
1 parent 2bd8155 commit 3c59c82
Show file tree
Hide file tree
Showing 8 changed files with 597 additions and 1 deletion.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/csv_export_hooks_data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/glue_export_hook_result_log.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
142 changes: 142 additions & 0 deletions docs/howto/catalog_exports.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: Data Catalogs Export
description: This section explains how lakeFS can integrate with external Data Catalogs via metastore update operations.
parent: How-To
---

# Data Catalogs Export

{% include toc_2-3.html %}

## About Data Catalogs Export

Data Catalog Export is all about integrating query engines (like Spark, AWS Athena, Presto, etc.) with lakeFS.

Data Catalogs (such as Hive Metastore or AWS Glue) store metadata for services (such as Spark, Trino and Athena). They contain metadata such as the location of the table, information about columns, partitions and much more.

With Data Catalog Exports, one can leverage the versioning capabilities of lakeFS in external data warehouses and query engines to access tables with branches and commits.

At the end of this guide, you will be able to query lakeFS data from Athena, Trino and other catalog-dependent tools:

```sql
USE main;
USE my_branch; -- any branch
USE v101; -- or tag

SELECT * FROM users
INNER JOIN events
ON users.id = events.user_id; -- SQL stays the same, branch or tag exist as schema
```

## How it works

Several well known formats exist today let you export existing tables in lakeFS into a "native" object store representation
which does not require copying the data outside of lakeFS.

These are metadata representations and can be applied automatically through hooks.

### Table Decleration

After creating a lakeFS repository, configure tables as table descriptor objects on the repository on the path `_lakefs_tables/TABLE.yaml`.
Note: the Glue exporter can currently only export tables of `type: hive`. We expect to add more.

#### Hive tables

Hive metadata server tables are essentially just a set of objects that share a prefix, with no table metadata stored on the object store. You need to configure prefix, partitions, and schema.

```yaml
name: animals
type: hive
path: path/to/animals/
partition_columns: ['year']
schema:
type: struct
fields:
- name: year
type: integer
nullable: false
metadata: {}
- name: page
type: string
nullable: false
metadata: {}
- name: site
type: string
nullable: true
metadata:
comment: a comment about this column
```
Useful types recognized by Hive include `integer`, `long`, `short`, `string`, `double`, `float`, `date`, and `timestamp`.
{: .note }

### Catalog Exporters

Exporters are code packages accessible through [Lua integration]({% link howto/hooks/lua.md %}#lua-library-reference). Each exporter is exposed as a Lua function under the package namespace `lakefs/catalogexport`. Call them from hooks to connect lakeFS tables to various catalogs.

#### Currently supported exporters

- Symlink Exporter: Writes metadata for the table using Hive's [SymlinkTextInputFormat](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html)
- AWS Glue Catalog (+ Athena) Exporter: Creates a table in Glue using Hive's format and updates the location to symlink files (reuses Symlink Exporter).
- See a step by step guide on how to integrate with [Glue Exporter]({% link integrations/glue_metastore.md %})

#### Running an Exporter

Exporters are meant to run as [Lua hooks]({% link howto/hooks/lua.md %}).

Configure the actions trigger by using [events and branches]({% link howto/hooks/index.md %}#action-file-schema). Of course, you can add additional custom filtering logic to the Lua script if needed.
The default table name when exported is `${repository_id}_${_lakefs_tables/TABLE.md(name field)}_${ref_name}_${short_commit}`.

Example of an action that will be triggered when a `post-commit` event happens in the `export_table` branch.

```yaml
name: Glue Table Exporter
description: export my table to glue
on:
post-commit:
branches: ["export_table"]
hooks:
- id: my_exporter
type: lua
properties:
# exporter script location
script_path: "scripts/my_export_script.lua"
args:
# table descriptor
table_source: '_lakefs_tables/my_table.yaml'
```

Tip: Actions can be extended to customize any desired behavior, for example validating branch names since they are part of the table name:

```yaml
# _lakefs_actions/validate_branch_name.yaml
name: validate-lower-case-branches
on:
pre-create-branch:
hooks:
- id: check_branch_id
type: lua
properties:
script: |
regexp = require("regexp")
if not regexp.match("^[a-z0-9\\_\\-]+$", action.branch_id) then
error("branches must be lower case, invalid branch ID: " .. action.branch_id)
end
```

### Flow

The following diagram demonstrates what happens when a lakeFS Action triggers runs a lua hook that calls an exporter.

```mermaid
sequenceDiagram
note over Lua Hook: lakeFS Action trigger. <br> Pass Context for the export.
Lua Hook->>Exporter: export request
note over Table Registry: _lakefs_tables/TABLE.yaml
Exporter->>Table Registry: Get table descriptor
Table Registry->>Exporter: Parse table structure
Exporter->>Object Store: materialize an exported table
Exporter->>Catalog: register object store location
Query Engine-->Catalog: Query
Query Engine-->Object Store: Query
```
193 changes: 193 additions & 0 deletions docs/howto/hooks/lua.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,55 @@ or:

Deletes all objects under the given prefix

### `aws/glue`

Glue client library.

```lua
local aws = require("aws")
-- pass valid AWS credentials
local glue = aws.glue_client("ACCESS_KEY_ID", "SECRET_ACCESS_KEY", "REGION")
```

### `aws/glue.get_table(database, table [, catalog_id)`

Describe a table from the Glue catalog.

Example:

```lua
local table, exists = glue.get_table(db, table_name)
if exists then
print(json.marshal(table))
```

### `aws/glue.create_table(database, table_input, [, catalog_id])`

Create a new table in Glue Catalog.
The `table_input` argument is a JSON that is passed "as is" to AWS and is parallel to the AWS SDK [TableInput](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateTable.html#API_CreateTable_RequestSyntax)

Example:

```lua
local json = require("encoding/json")
local input = {
Name = "my-table",
PartitionKeys = array(partitions),
-- etc...
}
local json_input = json.marshal(input)
glue.create_table("my-db", table_input)
```

### `aws/glue.update_table(database, table_input, [, catalog_id, version_id, skip_archive])`

Update an existing Table in Glue Catalog.
The `table_input` is the same as the argument in `glue.create_table` function.

### `aws/glue.delete_table(database, table_input, [, catalog_id])`

Delete an existing Table in Glue Catalog.

### `crypto/aes/encryptCBC(key, plaintext)`

Returns a ciphertext for the aes encrypted text
Expand Down Expand Up @@ -274,6 +323,150 @@ Returns 2 values:

Returns an object-wise diff of uncommitted changes on `branch_id`.

### `lakefs/catalogexport/table_extractor`

Utility package to parse `_lakefs_tables/` descriptors.

### `lakefs/catalogexport/table_extractor.list_table_descriptor_entries(client, repo_id, commit_id)`

List all YAML files under `_lakefs_tables/*` and return a list of type `[{physical_address, path}]`, ignores hidden files.
The `client` is `lakefs` client.

### `lakefs/catalogexport/table_extractor.get_table_descriptor(client, repo_id, commit_id, logical_path)`

Read a table descriptor and parse YAML object. Will set `partition_columns` to `{}` if no partitions are defined.
The `client` is `lakefs` client.

### `lakefs/catalogexport/hive.extract_partition_pager(client, repo_id, commit_id, base_path, partition_cols, page_size)`

Hive format partition iterator each result set is a collection of files under the same partition in lakeFS.

Example:

```lua
local lakefs = require("lakefs")
local pager = hive.extract_partition_pager(lakefs, repo_id, commit_id, prefix, partitions, 10)
for part_key, entries in pager do
print("partition: " .. part_key)
for _, entry in ipairs(entries) do
print("path: " .. entry.path .. " physical: " .. entry.physical_address)
end
end
```

### `lakefs/catalogexport/symlink_exporter`

Writes metadata for a table using Hive's [SymlinkTextInputFormat](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html).
Currently only `S3` is supported.

The default export paths per commit:

```
${storageNamespace}
_lakefs/
exported/
${ref}/
${commitId}/
${tableName}/
p1=v1/symlink.txt
p1=v2/symlink.txt
p1=v3/symlink.txt
...
```

### `lakefs/catalogexport/symlink_exporter.export_s3(s3_client, table_src_path, action_info [, options])`

Export Symlink files that represent a table to S3 location.

Parameters:

- `s3_client`: Configured client.
- `table_src_path(string)`: Path to the table spec YAML file in `_lakefs_tables` (i.e _lakefs_tables/my_table.yaml).
- `action_info(table)`: The global action object.
- `options(table)`:
- `debug(boolean)`: Print extra info.
- `export_base_uri(string)`: Override the prefix in S3 i.e `s3://other-bucket/path/`.
- `writer(function(bucket, key, data))`: If passed then will not use s3 client, helpful for debug.

Example:

```lua
local exporter = require("lakefs/catalogexport/symlink_exporter")
local aws = require("aws")
-- args are user inputs from a lakeFS action.
local s3 = aws.s3_client(args.aws.aws_access_key_id, args.aws.aws_secret_access_key, args.aws.aws_region)
exporter.export_s3(s3, args.table_descriptor_path, action, {debug=true})
```

### `lakefs/catalogexport/symlink_exporter.get_storage_uri_prefix(storage_ns, commit_id, action_info)`

Generate prefix for Symlink file(s) structure that represents a `ref` and a `commit` in lakeFS.
The output pattern `${storage_ns}_lakefs/exported/${ref}/${commit_id}/`.
The `ref` is deduced from the action event in `action_info` (i.e branch name).


### `lakefs/catalogexport/glue_exporter`

A Package for automating the export process from lakeFS stored tables into Glue catalog.

### `lakefs/catalogexport/glue_exporter.export_glue(glue, db, table_src_path, create_table_input, action_info, options)`

Represent lakeFS table in Glue Catalog.
This function will create a table in Glue based on configuration.
It assumes that there is a symlink location that is already created and only configures it by default for the same commit.

Parameters:

- `glue`: AWS glue client
- `db(string)`: glue database name
- `table_src_path(string)`: path to table spec (i.e _lakefs_tables/my_table.yaml)
- `create_table_input(Table)`: Input equal mapping to [table_input](https://docs.aws.amazon.com/glue/latest/webapAPI_CreateTable.html#API_CreateTable_RequestSyntax) in AWS, the same as we use for `glue.create_table`.
should contain inputs describing the data format (i.e InputFormat, OutputFormat, SerdeInfo) since the exporter is agnostic to this.
by default this function will configure table location and schema.
- `action_info(Table)`: the global action object.
- `options(Table)`:
- `table_name(string)`: Override default glue table name
- `debug(boolean`
- `export_base_uri(string)`: Override the default prefix in S3 for symlink location i.e s3://other-bucket/path/

When creating a glue table, the final table input will consist of the `create_table_input` input parameter and lakeFS computed defaults that will override it:

- `Name` Gable table name `get_full_table_name(descriptor, action_info)`.
- `PartitionKeys` Partition columns usually deduced from `_lakefs_tables/${table_src_path}`.
- `TableType` = "EXTERNAL_TABLE"
- `StorageDescriptor`: Columns usually deduced from `_lakefs_tables/${table_src_path}`.
- `StorageDescriptor.Location` = symlink_location

Example:

```lua
local aws = require("aws")
local exporter = require("lakefs/catalogexport/glue_exporter")
local glue = aws.glue_client(args.aws_access_key_id, args.aws_secret_access_key, args.aws_region)
-- table_input can be passed as a simple Key-Value object in YAML as an argument from an action, this is inline example:
local table_input = {
StorageDescriptor:
InputFormat: "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat"
OutputFormat: "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat"
SerdeInfo:
SerializationLibrary: "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
Parameters:
classification: "parquet"
EXTERNAL: "TRUE"
"parquet.compression": "SNAPPY"
}
exporter.export_glue(glue, "my-db", "_lakefs_tables/animals.yaml", table_input, action, {debug=true})
```

### `lakefs/catalogexport/glue_exporter.get_full_table_name(descriptor, action_info)`

Generate glue table name.

Parameters:

- `descriptor(Table)`: Object from (i.e _lakefs_tables/my_table.yaml).
- `action_info(Table)`: The global action object.

### `path/parse(path_string)`

Returns a table for the given path string with the following structure:
Expand Down
1 change: 1 addition & 0 deletions docs/howto/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ has_toc: false

* [Import](/howto/import.html) and [Export Data](/howto/export.html) from lakeFS
* [Copy data](/howto/copying.html) to/from lakeFS
* [Using external Data Catalogs](/howto/catalog_exports.html) with data stored on lakeFS
* [Migrating](/howto/migrate-away.html) away from lakeFS

## Actions and Hooks in lakeFS
Expand Down
2 changes: 1 addition & 1 deletion docs/integrations/glue_hive_metastore.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ redirect_from: /using/glue_hive_metastore.html

{: .warning }
**Deprecated Feature:** Having heard the feedback from the community, we are planning to replace the below manual steps with an automated process.
You can read more about it [here](https://github.com/treeverse/lakeFS/issues/6461).
You can read more about it [here]({% link howto/catalog_exports.md %}).

{% include toc_2-3.html %}

Expand Down
Loading

0 comments on commit 3c59c82

Please sign in to comment.