feat: AWS glue catalog support for iceberg_scan() #51

rustyconover · 2024-04-11T01:04:50Z

Add support for accessing tables stored at AWS Glue.

Example SQL call:

select * from iceberg_scan('{ "catalog_type": "glue", "region": "us-east-1", "database_name": "test_iceberg", "table_name": "users"}');

Added the framework for more additional external Iceberg catalog:

This JSON object should be of this format:

{
  "catalog_type": "glue",
  "catalog": "1234567890",          // optional - the catalog to use
  "region": "us-east-1",            // required - change to the right region
  "database_name": "test_iceberg",  // required - change to your database
  "table_name": "table_name"        // required - change for each table.
}

Many comparisons where performed using `yyjson_get_tag()` rather than `yyjson_get_type()`. The tag can have additional information set using bits beyond just the type, causing these type comparisons to fail and JSON failing to parse. fix: fix the extension to build with current duckdb main branch. Fix a few std::move() calls and a call to fs.OpenFile().

samansmink · 2024-04-11T07:49:01Z

Hey @rustyconover! Thanks a lot for the PR's!

To review this I will need to setup some aws glue table table myself to test it out, I will try to find some time tomorrow to do this.

One small comment I do have already is that I'm not sure the json string is the neatest way of passing the configuration to the Iceberg scan function. Maybe we can instead just add all of them as named_parameters to the iceberg table function. I think many of these will be shared among catalog_types anyway and that way the parser will help give meaningful error messages and syntax highlighting of the SQL strings works better.

rustyconover · 2024-04-11T14:34:03Z

Hi @samansmink,

I'll look at changing to named parameters and post a revised PR.

Rusty

fix: add support for iceberg_metadata function. Change a lot of static functions around so that the configuration for the catalog information can be easily passed around.

rustyconover · 2024-04-11T23:16:27Z

Hi @samansmink,

I've changed things around to use named parameters and added the support so that the iceberg_metadata() function can also use the same configuration.

Rusty

rustyconover · 2024-04-11T23:17:24Z

You can now run queries that look like this:

select * from iceberg_scan('users', catalog_type="glue", region="us-east-1", database_name="test_iceberg");

select * from iceberg_metadata('users', catalog_type="glue", region="us-east-1", database_name="test_iceberg");

harel-e · 2024-04-12T05:30:49Z

@rustyconover - Thank you for this PR and #50.
I have access to Iceberg tables on AWS Glue and can help testing this feature.
Is it possible to provide a binary or docker image for this PR? I'm having issues building Duckdb locally.
If the binary will contain #50, I can test that one as well.

rustyconover · 2024-04-12T13:10:20Z

Hi @harel-e,

Thank you for your kind words.

Unfortunately I can't help you build the extension or package it as a Docker container. You might want to try asking on the DuckDB discord for help building DuckDB.

I'm building it on Mac OS X. I had to make some changes to vcpkg to work around the fall out of the xz package unavailability with boost.

Rusty

samansmink · 2024-04-15T19:50:53Z

vcpkg should be restored again from the xz debacle afaik! Check out https://github.com/duckdb/extension-template for some instructions on setting up vcpkg for extension builds.

harel-e · 2024-05-07T15:24:49Z

I tested this branch on AWS with several Iceberg tables.

This query pattens works fine:
select * from iceberg_scan('users', catalog_type="glue", region="us-east-1", database_name="test_iceberg");

Hoping to see it in the upcoming 0.10.3

Thank you @rustyconover for this wonderful addition. DuckDB is now one step closer to work seamlessly in AWS

samansmink · 2024-05-07T15:49:10Z

Sorry for the absence here, I've been really busy

There are still some problems remaining with CI here on windows and linux amd64, those would need to be fixed for this to get merged before 0.10.3

rustyconover · 2024-05-07T19:28:17Z

I'll take a look at the linux build failures, but the windows ones I don't have access to that platform.

arnabneogi86 · 2024-06-19T09:02:32Z

@rustyconover : Does this support Nessie catalog for iceberg?

janosszendivarga · 2024-08-12T12:22:47Z

Any chance to make this PR merged?

szalai1 · 2024-08-27T13:20:17Z

@rustyconover are you still working on this? would it make sense for someone to pick this up?

rustyconover · 2024-08-27T13:22:18Z

I'm not actively working on this PR, feel free to finish it up.

jyorko · 2024-10-19T02:02:37Z

We need catalogue support

panga · 2024-12-19T17:01:13Z

I think it would be better to add an ICEBERG catalog secret type instead of using name parameters/json.

That way you could just query and not worry about configuration it every time.

Also, same secret can be used for other catalog types (e.g: REST).

rustyconover added 3 commits April 10, 2024 07:43

fix: update yyjson vendor dependency

700594f

feature: add support for AWS Glue catalog

dc47949

This was referenced Apr 11, 2024

Support AWS Glue Catalog #1

Closed

Iceberg REST Catalog Support #16

Open

fix: change to using named paramters.

e75d17b

fix: add support for iceberg_metadata function. Change a lot of static functions around so that the configuration for the catalog information can be easily passed around.

fix: remove docs about data catalog usage

02be67b

Merge branch 'main' into feature_glue_catalog_support

0669e19

philippemnoel mentioned this pull request Aug 28, 2024

Add support for Catalog Providers paradedb/pg_analytics#108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AWS glue catalog support for iceberg_scan() #51

feat: AWS glue catalog support for iceberg_scan() #51

rustyconover commented Apr 11, 2024

samansmink commented Apr 11, 2024

rustyconover commented Apr 11, 2024

rustyconover commented Apr 11, 2024

rustyconover commented Apr 11, 2024

harel-e commented Apr 12, 2024

rustyconover commented Apr 12, 2024

samansmink commented Apr 15, 2024

harel-e commented May 7, 2024

samansmink commented May 7, 2024

rustyconover commented May 7, 2024

arnabneogi86 commented Jun 19, 2024

janosszendivarga commented Aug 12, 2024

szalai1 commented Aug 27, 2024

rustyconover commented Aug 27, 2024

jyorko commented Oct 19, 2024

panga commented Dec 19, 2024

feat: AWS glue catalog support for iceberg_scan() #51

Are you sure you want to change the base?

feat: AWS glue catalog support for iceberg_scan() #51

Conversation

rustyconover commented Apr 11, 2024

samansmink commented Apr 11, 2024

rustyconover commented Apr 11, 2024

rustyconover commented Apr 11, 2024

rustyconover commented Apr 11, 2024

harel-e commented Apr 12, 2024

rustyconover commented Apr 12, 2024

samansmink commented Apr 15, 2024

harel-e commented May 7, 2024

samansmink commented May 7, 2024

rustyconover commented May 7, 2024

arnabneogi86 commented Jun 19, 2024

janosszendivarga commented Aug 12, 2024

szalai1 commented Aug 27, 2024

rustyconover commented Aug 27, 2024

jyorko commented Oct 19, 2024

panga commented Dec 19, 2024