Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Glue Exporter and related #6823

Merged
merged 48 commits into from
Oct 26, 2023
Merged

Docs: Glue Exporter and related #6823

merged 48 commits into from
Oct 26, 2023

Conversation

Isan-Rivkin
Copy link
Contributor

@Isan-Rivkin Isan-Rivkin commented Oct 19, 2023

Closes #6685

Will merge when I have 2 human approvers ✅ ✅

What is covered in the docs

  1. Lua Library Spec
  2. Data Catalogs export high level
  3. AWS Glue integration with Catalog Exports guide

lakeFS User perspective POV

A User should read the AWS Glue integration to achieve their goal.
From time to time they should jump to the General explanation and Library Spec to get deeper understand to some open questions they might have.

@Isan-Rivkin Isan-Rivkin self-assigned this Oct 19, 2023
@Isan-Rivkin Isan-Rivkin added export-hooks include-changelog PR description should be included in next release changelog docs Improvements or additions to documentation labels Oct 19, 2023
@Isan-Rivkin Isan-Rivkin changed the title [Docs]: Glue Exporter and related Docs: Glue Exporter and related Oct 19, 2023
@github-actions
Copy link

github-actions bot commented Oct 19, 2023

♻️ PR Preview a0c0cc9 has been successfully destroyed since this PR has been closed.

🤖 By surge-preview

@Isan-Rivkin Isan-Rivkin requested a review from johnnyaug October 24, 2023 07:07
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

This is really neat. I found the use of the passive voice somewhat difficult to read, and I suggest using an active voice in several places. You are of course free to disregard this, or alternatively to find a few other places that continue to use the passive voice.

docs/howto/catalog_exports.md Outdated Show resolved Hide resolved
docs/howto/catalog_exports.md Outdated Show resolved Hide resolved
docs/howto/catalog_exports.md Outdated Show resolved Hide resolved
docs/howto/catalog_exports.md Outdated Show resolved Hide resolved

#### Hive tables

Hive metadata server tables are essentially just a set of objects that share a prefix, with no table metadata stored on the object store. You need to configure prefix, partitions, and schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we already define "hive tables" in our docs elsewhere. Can we unify these definitions? (Fine to open an issue to do this in a future PR)

Copy link
Contributor Author

@Isan-Rivkin Isan-Rivkin Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: #6863

You're right but the current definition is so slim 1 sentence and an example so I prefer not to unify this currently since it's really a per-case definition.
I'll open an issue because I prefer to have a more mature and organized definition when doing that and not ad-hoc in this PR.


### `aws/glue.get_table(database, table [, catalog_id)`

Get Table from Glue Catalog.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalization seems different than elsewhere.

@@ -142,6 +142,55 @@ or:

Deletes all objects under the given prefix

### `aws/glue`

AWS Glue Client.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalization seems different than elsewhere.

docs/integrations/glue_metastore.md Outdated Show resolved Hide resolved
docs/integrations/glue_metastore.md Outdated Show resolved Hide resolved
docs/integrations/glue_metastore.md Outdated Show resolved Hide resolved
Isan-Rivkin and others added 9 commits October 25, 2023 10:05
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Copy link
Contributor

@johnnyaug johnnyaug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two new pages should be merged to one, under the "How To" category.
It should be a guide teaching the user how to use lakeFS together with Glue Data Catalog.
My reasoning is that it's not really an integration but more of a cookbook.

(we can add a page in the integration section that will only contain a link to the guide).

I'm leaving my comments on the "glue_hive_metastore.md" file, since I think it's the one we should keep (and like I said, IMO should move to be under How-To).

docs/integrations/glue_metastore.md Outdated Show resolved Hide resolved
docs/integrations/glue_metastore.md Outdated Show resolved Hide resolved
Comment on lines 13 to 15
## About Glue Metastore

[AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html) has a metastore that can store metadata related to Hive and other services (such as Spark and Trino). It has metadata such as the location of the table, information about columns, partitions and much more.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, IMO

Comment on lines 18 to 41
## Support in lakeFS

The integration between Glue and lakeFS is based on [Data Catalog Exports]({% link howto/catalog_exports.md %}).

### What is supported

- Creating a unique table in Glue Catalog per lakeFS repository / ref / commit.
- No data copying is required, the table location is a path to a symlinks structure in S3 based on Hive's [SymlinkTextInputFormat](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html) and the [table partitions](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#tables-partition) are maintained.
- Tables are described via [Hive format in `_lakefs_tables/<my_table>.yaml`]({% link howto/catalog_exports.md %}#hive-tables).
- Currently the data query in Glue metastore is Read-Only operation and mutating data requires writting to lakeFS and letting the export hook run.

### How it works

Based on event lakeFS events such as `post-commit` an Action will run a script that will create Symlink structures in S3 and then will register a table in Glue.
The Table data location will be the generated Symlinks root path.

There are 4 key pieces:

1. Table description at `_lakefs_tables/<your-table>.yaml`
2. Lua script that will do the export using [symlink_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportsymlink_exporter) and [glue_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporter) packages.
3. [Action Lua Hook]({% link howto/catalog_exports.md %}#running-an-exporter) to execute the lua hook.
4. Write some lakeFS table data ([Spark]({% link integrations/spark.md %}), CSV, etc)

To learn more check [Data Catalog Exports]({% link howto/catalog_exports.md %}).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Support in lakeFS
The integration between Glue and lakeFS is based on [Data Catalog Exports]({% link howto/catalog_exports.md %}).
### What is supported
- Creating a unique table in Glue Catalog per lakeFS repository / ref / commit.
- No data copying is required, the table location is a path to a symlinks structure in S3 based on Hive's [SymlinkTextInputFormat](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html) and the [table partitions](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#tables-partition) are maintained.
- Tables are described via [Hive format in `_lakefs_tables/<my_table>.yaml`]({% link howto/catalog_exports.md %}#hive-tables).
- Currently the data query in Glue metastore is Read-Only operation and mutating data requires writting to lakeFS and letting the export hook run.
### How it works
Based on event lakeFS events such as `post-commit` an Action will run a script that will create Symlink structures in S3 and then will register a table in Glue.
The Table data location will be the generated Symlinks root path.
There are 4 key pieces:
1. Table description at `_lakefs_tables/<your-table>.yaml`
2. Lua script that will do the export using [symlink_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportsymlink_exporter) and [glue_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporter) packages.
3. [Action Lua Hook]({% link howto/catalog_exports.md %}#running-an-exporter) to execute the lua hook.
4. Write some lakeFS table data ([Spark]({% link integrations/spark.md %}), CSV, etc)
To learn more check [Data Catalog Exports]({% link howto/catalog_exports.md %}).
This guide will show you how to use lakeFS with the Glue Data Catalog.
You'll be able to query your lakeFS data by specifying the branch, ref, or commit in your SQL query.
Currently, only read operations are supported on the tables.
You will set up the automation required to work with lakeFS on top of the Glue Data Catalog, including:
1. Create a table descriptor under `_lakefs_tables/<your-table>.yaml`. This will represent your table schema.
2. Write an exporter script that will:
* Mirror your branch's state into symlink files readable by Athena
* Export the table descriptors from your branch to the Glue Catalog.
3. Set up lakeFS hooks that will run the above script when specific events occur.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That's an awesome suggestion, did that with minor changes + added links.

docs/integrations/glue_metastore.md Outdated Show resolved Hide resolved

### Write some table data

Add some table data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to keep this section short, even a one-liner if that's possible. It's not in the scope of the article, it's more like a means to an end.
It can be something like "Insert data under the table path, using your preferred method. For example, here is how to write data using Spark: ..."

The commit bit doesn't need to be in the code, you can simply say "then perform a commit in lakeFS".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnnyaug is right that this doesn't belong in "integrations". Unfortunately if you remove this text, it means you've just been volunteered to write the blog post / tech note.

Copy link
Contributor Author

@Isan-Rivkin Isan-Rivkin Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielshaqed so setting volunteering aside, you agree that this step by step tutorial on how to write the data itself should be removed to a general guidance? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnnyaug I took your advice and removed 99% of the content from there!

Comment on lines 202 to 218
### Add Glue Exporter

The current step requires 2 things:
1. Lua script.
2. Action to trigger the script.

#### Lua packages (Exporters)

- [symlink_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportsymlink_exporter)
- [glue_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporter)

#### Exporter Script

##### 1. Create Lua script:

For the simple strategy of creating a glue table per repo / branch / commit we can simply copy-paste the following script and re-use it.
Upload the script to `scripts/animals_exporter.lua` (could be any path).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Add Glue Exporter
The current step requires 2 things:
1. Lua script.
2. Action to trigger the script.
#### Lua packages (Exporters)
- [symlink_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportsymlink_exporter)
- [glue_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporter)
#### Exporter Script
##### 1. Create Lua script:
For the simple strategy of creating a glue table per repo / branch / commit we can simply copy-paste the following script and re-use it.
Upload the script to `scripts/animals_exporter.lua` (could be any path).
### The exporter script
Upload the following script to your main branch under `scripts/animals_exporter.lua` (or a path of your choice).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like it! but with a little twist I kept the libraries reference.

local res = glue_exporter.export_glue(glue, db, table_path, table_input, action, {debug=true})
```

##### 2. Configure Action Hooks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### 2. Configure Action Hooks:
### Configure Action Hooks

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yoni is right, as always.

StillLGTM: I liked it before, and I am sure I will like it even more with all of Yoni's comments!


### Write some table data

Add some table data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnnyaug is right that this doesn't belong in "integrations". Unfortunately if you remove this text, it means you've just been volunteered to write the blog post / tech note.

@Isan-Rivkin Isan-Rivkin requested a review from johnnyaug October 26, 2023 09:21
@Isan-Rivkin
Copy link
Contributor Author

Isan-Rivkin commented Oct 26, 2023

@johnnyaug thanks for the great comments I took all of them!
One thing I wouldn't want to change is the merge of the 2 documents.
The idea is that Catalog Exports is a high level concept and we are going to add more exporters, there's no reason for it to live in 1 document that's even more distracting.

Copy link
Contributor

@johnnyaug johnnyaug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you for considering my comments!

@Isan-Rivkin Isan-Rivkin merged commit 3c59c82 into master Oct 26, 2023
29 of 30 checks passed
@Isan-Rivkin Isan-Rivkin deleted the 6685-export-hooks-docs branch October 26, 2023 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Improvements or additions to documentation export-hooks include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Docs]: Glue Exporter
3 participants