Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: AWS glue catalog support for iceberg_scan() #51

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

rustyconover
Copy link
Contributor

Add support for accessing tables stored at AWS Glue.

Example SQL call:

select * from iceberg_scan('{ "catalog_type": "glue", "region": "us-east-1", "database_name": "test_iceberg", "table_name": "users"}');

Added the framework for more additional external Iceberg catalog:

This JSON object should be of this format:

{
  "catalog_type": "glue",
  "catalog": "1234567890",          // optional - the catalog to use
  "region": "us-east-1",            // required - change to the right region
  "database_name": "test_iceberg",  // required - change to your database
  "table_name": "table_name"        // required - change for each table.
}

Many comparisons where performed using `yyjson_get_tag()` rather
than `yyjson_get_type()`.  The tag can have additional information
set using bits beyond just the type, causing these type comparisons
to fail and JSON failing to parse.

fix: fix the extension to build with current duckdb main branch.

Fix a few std::move() calls and a call to fs.OpenFile().
This was referenced Apr 11, 2024
@samansmink
Copy link
Collaborator

Hey @rustyconover! Thanks a lot for the PR's!

To review this I will need to setup some aws glue table table myself to test it out, I will try to find some time tomorrow to do this.

One small comment I do have already is that I'm not sure the json string is the neatest way of passing the configuration to the Iceberg scan function. Maybe we can instead just add all of them as named_parameters to the iceberg table function. I think many of these will be shared among catalog_types anyway and that way the parser will help give meaningful error messages and syntax highlighting of the SQL strings works better.

@rustyconover
Copy link
Contributor Author

Hi @samansmink,

I'll look at changing to named parameters and post a revised PR.

Rusty

fix: add support for iceberg_metadata function.

Change a lot of static functions around so that the configuration
for the catalog information can be easily passed around.
@rustyconover
Copy link
Contributor Author

Hi @samansmink,

I've changed things around to use named parameters and added the support so that the iceberg_metadata() function can also use the same configuration.

Rusty

@rustyconover
Copy link
Contributor Author

You can now run queries that look like this:

select * from iceberg_scan('users', catalog_type="glue", region="us-east-1", database_name="test_iceberg");

select * from iceberg_metadata('users', catalog_type="glue", region="us-east-1", database_name="test_iceberg");

@harel-e
Copy link

harel-e commented Apr 12, 2024

@rustyconover - Thank you for this PR and #50.
I have access to Iceberg tables on AWS Glue and can help testing this feature.
Is it possible to provide a binary or docker image for this PR? I'm having issues building Duckdb locally.
If the binary will contain #50, I can test that one as well.

@rustyconover
Copy link
Contributor Author

Hi @harel-e,

Thank you for your kind words.

Unfortunately I can't help you build the extension or package it as a Docker container. You might want to try asking on the DuckDB discord for help building DuckDB.

I'm building it on Mac OS X. I had to make some changes to vcpkg to work around the fall out of the xz package unavailability with boost.

Rusty

@samansmink
Copy link
Collaborator

vcpkg should be restored again from the xz debacle afaik! Check out https://github.com/duckdb/extension-template for some instructions on setting up vcpkg for extension builds.

@harel-e
Copy link

harel-e commented May 7, 2024

I tested this branch on AWS with several Iceberg tables.

This query pattens works fine:
select * from iceberg_scan('users', catalog_type="glue", region="us-east-1", database_name="test_iceberg");

Hoping to see it in the upcoming 0.10.3

Thank you @rustyconover for this wonderful addition. DuckDB is now one step closer to work seamlessly in AWS

@samansmink
Copy link
Collaborator

Sorry for the absence here, I've been really busy

There are still some problems remaining with CI here on windows and linux amd64, those would need to be fixed for this to get merged before 0.10.3

@rustyconover
Copy link
Contributor Author

I'll take a look at the linux build failures, but the windows ones I don't have access to that platform.

@arnabneogi86
Copy link

@rustyconover : Does this support Nessie catalog for iceberg?

@janosszendivarga
Copy link

Any chance to make this PR merged?

@szalai1
Copy link

szalai1 commented Aug 27, 2024

@rustyconover are you still working on this? would it make sense for someone to pick this up?

@rustyconover
Copy link
Contributor Author

I'm not actively working on this PR, feel free to finish it up.

@jyorko
Copy link

jyorko commented Oct 19, 2024

We need catalogue support

@panga
Copy link

panga commented Dec 19, 2024

I think it would be better to add an ICEBERG catalog secret type instead of using name parameters/json.

That way you could just query and not worry about configuration it every time.

Also, same secret can be used for other catalog types (e.g: REST).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants