-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: Glue Exporter and related #6823
Conversation
♻️ PR Preview a0c0cc9 has been successfully destroyed since this PR has been closed. 🤖 By surge-preview |
Co-authored-by: Yoni <[email protected]>
Co-authored-by: Yoni <[email protected]>
Co-authored-by: Yoni <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
This is really neat. I found the use of the passive voice somewhat difficult to read, and I suggest using an active voice in several places. You are of course free to disregard this, or alternatively to find a few other places that continue to use the passive voice.
|
||
#### Hive tables | ||
|
||
Hive metadata server tables are essentially just a set of objects that share a prefix, with no table metadata stored on the object store. You need to configure prefix, partitions, and schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we already define "hive tables" in our docs elsewhere. Can we unify these definitions? (Fine to open an issue to do this in a future PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: #6863
You're right but the current definition is so slim 1 sentence and an example so I prefer not to unify this currently since it's really a per-case definition.
I'll open an issue because I prefer to have a more mature and organized definition when doing that and not ad-hoc in this PR.
docs/howto/hooks/lua.md
Outdated
|
||
### `aws/glue.get_table(database, table [, catalog_id)` | ||
|
||
Get Table from Glue Catalog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalization seems different than elsewhere.
docs/howto/hooks/lua.md
Outdated
@@ -142,6 +142,55 @@ or: | |||
|
|||
Deletes all objects under the given prefix | |||
|
|||
### `aws/glue` | |||
|
|||
AWS Glue Client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalization seems different than elsewhere.
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
Co-authored-by: Ariel Shaqed (Scolnicov) <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the two new pages should be merged to one, under the "How To" category.
It should be a guide teaching the user how to use lakeFS together with Glue Data Catalog.
My reasoning is that it's not really an integration but more of a cookbook.
(we can add a page in the integration section that will only contain a link to the guide).
I'm leaving my comments on the "glue_hive_metastore.md" file, since I think it's the one we should keep (and like I said, IMO should move to be under How-To).
docs/integrations/glue_metastore.md
Outdated
## About Glue Metastore | ||
|
||
[AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html) has a metastore that can store metadata related to Hive and other services (such as Spark and Trino). It has metadata such as the location of the table, information about columns, partitions and much more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed, IMO
docs/integrations/glue_metastore.md
Outdated
## Support in lakeFS | ||
|
||
The integration between Glue and lakeFS is based on [Data Catalog Exports]({% link howto/catalog_exports.md %}). | ||
|
||
### What is supported | ||
|
||
- Creating a unique table in Glue Catalog per lakeFS repository / ref / commit. | ||
- No data copying is required, the table location is a path to a symlinks structure in S3 based on Hive's [SymlinkTextInputFormat](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html) and the [table partitions](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#tables-partition) are maintained. | ||
- Tables are described via [Hive format in `_lakefs_tables/<my_table>.yaml`]({% link howto/catalog_exports.md %}#hive-tables). | ||
- Currently the data query in Glue metastore is Read-Only operation and mutating data requires writting to lakeFS and letting the export hook run. | ||
|
||
### How it works | ||
|
||
Based on event lakeFS events such as `post-commit` an Action will run a script that will create Symlink structures in S3 and then will register a table in Glue. | ||
The Table data location will be the generated Symlinks root path. | ||
|
||
There are 4 key pieces: | ||
|
||
1. Table description at `_lakefs_tables/<your-table>.yaml` | ||
2. Lua script that will do the export using [symlink_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportsymlink_exporter) and [glue_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporter) packages. | ||
3. [Action Lua Hook]({% link howto/catalog_exports.md %}#running-an-exporter) to execute the lua hook. | ||
4. Write some lakeFS table data ([Spark]({% link integrations/spark.md %}), CSV, etc) | ||
|
||
To learn more check [Data Catalog Exports]({% link howto/catalog_exports.md %}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Support in lakeFS | |
The integration between Glue and lakeFS is based on [Data Catalog Exports]({% link howto/catalog_exports.md %}). | |
### What is supported | |
- Creating a unique table in Glue Catalog per lakeFS repository / ref / commit. | |
- No data copying is required, the table location is a path to a symlinks structure in S3 based on Hive's [SymlinkTextInputFormat](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.html) and the [table partitions](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#tables-partition) are maintained. | |
- Tables are described via [Hive format in `_lakefs_tables/<my_table>.yaml`]({% link howto/catalog_exports.md %}#hive-tables). | |
- Currently the data query in Glue metastore is Read-Only operation and mutating data requires writting to lakeFS and letting the export hook run. | |
### How it works | |
Based on event lakeFS events such as `post-commit` an Action will run a script that will create Symlink structures in S3 and then will register a table in Glue. | |
The Table data location will be the generated Symlinks root path. | |
There are 4 key pieces: | |
1. Table description at `_lakefs_tables/<your-table>.yaml` | |
2. Lua script that will do the export using [symlink_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportsymlink_exporter) and [glue_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporter) packages. | |
3. [Action Lua Hook]({% link howto/catalog_exports.md %}#running-an-exporter) to execute the lua hook. | |
4. Write some lakeFS table data ([Spark]({% link integrations/spark.md %}), CSV, etc) | |
To learn more check [Data Catalog Exports]({% link howto/catalog_exports.md %}). | |
This guide will show you how to use lakeFS with the Glue Data Catalog. | |
You'll be able to query your lakeFS data by specifying the branch, ref, or commit in your SQL query. | |
Currently, only read operations are supported on the tables. | |
You will set up the automation required to work with lakeFS on top of the Glue Data Catalog, including: | |
1. Create a table descriptor under `_lakefs_tables/<your-table>.yaml`. This will represent your table schema. | |
2. Write an exporter script that will: | |
* Mirror your branch's state into symlink files readable by Athena | |
* Export the table descriptors from your branch to the Glue Catalog. | |
3. Set up lakeFS hooks that will run the above script when specific events occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That's an awesome suggestion, did that with minor changes + added links.
docs/integrations/glue_metastore.md
Outdated
|
||
### Write some table data | ||
|
||
Add some table data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to keep this section short, even a one-liner if that's possible. It's not in the scope of the article, it's more like a means to an end.
It can be something like "Insert data under the table path, using your preferred method. For example, here is how to write data using Spark: ..."
The commit bit doesn't need to be in the code, you can simply say "then perform a commit in lakeFS".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itaiad200 @arielshaqed opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnnyaug is right that this doesn't belong in "integrations". Unfortunately if you remove this text, it means you've just been volunteered to write the blog post / tech note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arielshaqed so setting volunteering aside, you agree that this step by step tutorial on how to write the data itself should be removed to a general guidance? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnnyaug I took your advice and removed 99% of the content from there!
docs/integrations/glue_metastore.md
Outdated
### Add Glue Exporter | ||
|
||
The current step requires 2 things: | ||
1. Lua script. | ||
2. Action to trigger the script. | ||
|
||
#### Lua packages (Exporters) | ||
|
||
- [symlink_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportsymlink_exporter) | ||
- [glue_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporter) | ||
|
||
#### Exporter Script | ||
|
||
##### 1. Create Lua script: | ||
|
||
For the simple strategy of creating a glue table per repo / branch / commit we can simply copy-paste the following script and re-use it. | ||
Upload the script to `scripts/animals_exporter.lua` (could be any path). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Add Glue Exporter | |
The current step requires 2 things: | |
1. Lua script. | |
2. Action to trigger the script. | |
#### Lua packages (Exporters) | |
- [symlink_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportsymlink_exporter) | |
- [glue_exporter]({% link howto/hooks/lua.md %}#lakefscatalogexportglue_exporter) | |
#### Exporter Script | |
##### 1. Create Lua script: | |
For the simple strategy of creating a glue table per repo / branch / commit we can simply copy-paste the following script and re-use it. | |
Upload the script to `scripts/animals_exporter.lua` (could be any path). | |
### The exporter script | |
Upload the following script to your main branch under `scripts/animals_exporter.lua` (or a path of your choice). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like it! but with a little twist I kept the libraries reference.
docs/integrations/glue_metastore.md
Outdated
local res = glue_exporter.export_glue(glue, db, table_path, table_input, action, {debug=true}) | ||
``` | ||
|
||
##### 2. Configure Action Hooks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
##### 2. Configure Action Hooks: | |
### Configure Action Hooks |
Co-authored-by: Yoni <[email protected]>
Co-authored-by: Yoni <[email protected]>
Co-authored-by: Yoni <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yoni is right, as always.
StillLGTM: I liked it before, and I am sure I will like it even more with all of Yoni's comments!
docs/integrations/glue_metastore.md
Outdated
|
||
### Write some table data | ||
|
||
Add some table data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnnyaug is right that this doesn't belong in "integrations". Unfortunately if you remove this text, it means you've just been volunteered to write the blog post / tech note.
@johnnyaug thanks for the great comments I took all of them! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you for considering my comments!
Closes #6685
Will merge when I have 2 human approvers ✅ ✅
What is covered in the docs
lakeFS User perspective POV
A User should read the AWS Glue integration to achieve their goal.
From time to time they should jump to the General explanation and Library Spec to get deeper understand to some open questions they might have.