Support for S3 catalog to work with S3 Tables #1404

nicor88 · 2024-12-05T15:24:34Z

Feature Request / Improvement

Amazon S3 tables have being launched, see this, and looks like that S3 tables have a managed iceberg catalog.

Based on https://github.com/awslabs/s3-tables-catalog it looks like that AWS build an S3 catalog wrapper using java, that can be used by query engines like Spark/Trino.
It will be relevant to write to S3 tables via pyiceberg.

More context

Based on my understanding, once an S3 table is created, iceberg metadata are not initialized.
For a freshly created table, it's possible to retrieve the warehouseLocation -> see get_table.
The warehouseLocation looks like a unique S3 bucket, where you can put S3 objects in it.
After putting the S3 objects of an iceberg commit operation: data+metadata, it's possible to use update_table_metadata_location to point the S3 table to the right location.

Note: I'm not 100% sure on the above - and I need to validate it via some tests.

kevinjqliu · 2024-12-05T18:14:22Z

Thanks for raising this @nicor88! Would you be interested to contribute this feature?

kevinjqliu · 2024-12-05T18:14:40Z

The catalog implementation can be found here
https://github.com/awslabs/s3-tables-catalog/blob/adfeece9873f06364a4a093bfedacb5efe4a952b/src/software/amazon/s3tables/iceberg/S3TablesCatalog.java

nlm4145 · 2024-12-06T19:44:44Z

I also would be interested in this feature.

nicor88 · 2024-12-09T08:05:31Z

@kevinjqliu Unfortunately I don't have the capacity at the moment to contribute to this feature.
I would nevertheless be available to look at the PR and test the implementation.

felixscherz · 2024-12-10T09:00:16Z

I'm also interested, I will have a look at the reference @nicor88 provided and create a PR if I can get something to work:)

petehanssens · 2024-12-11T10:18:52Z

Super keen to see this happen too!

nicor88 · 2024-12-12T09:14:11Z

It looks like that the warehouse location of those S3 tables doesn't support List operations.
I tried to point my local warehouse (using SQLite) to the warehouse location of an S3 table, just to validate if all could work, and I got this error:

AWS Error UNKNOWN (HTTP status 405) during ListObjectsV2 operation: Unable to parse ExceptionName: MethodNotAllowed Message: The specified method is not allowed against this resource.

The issue seems to come from pyarrow, that does this check:

if not overwrite and self.exists() is True:
    raise FileExistsError(f"Cannot create file, already exists: {self.location}")
output_file = self._filesystem.open_output_stream(self._path, buffer_size=self._buffer_size)

The self.exists(), triggers under the hood a list operation, that it's not supported.....

felixscherz · 2024-12-14T16:23:17Z

I created an intial PR #1429 where I am currently working on supporting table creation. I ran into the same issue that @nicor88 described and could work around it by setting overwrite=True for now.
However, now I get a different error during the write operation for the table metadata:

AWS Error UNKNOWN (HTTP status 400) during CompleteMultipartUpload operation: Unable to parse ExceptionName: S3TablesUnsupportedHeader Message: S3 Tables does not support the following header: x-amz-api-version value: 2006-03-01

I'm currently going through the pyarrow S3FileSystem implementation to see where this header is being introduced.

EDIT:

I tried using a different FileIO and the issue disappears when using pyiceberg.io.fsspecFileIO explicitly via:

properties = {"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"}

seems like this is indeed specific to pyarrow

jamesbornholt · 2024-12-17T02:46:41Z

@felixscherz thanks for catching this (and thanks to everyone who's interested in building S3 Tables support for PyIceberg!). We're working on an S3-side fix for the x-amz-api-version exception you're seeing; hoping to have that out soon.

buremba · 2024-12-17T13:09:17Z

@jamesbornholt Great to hear that you folks are keeping an eye on here!
Sorry if this is not the right channel to ask the question but considering S3Tables is a mix of catalog + storage layer, is there any plan to provide Iceberg REST compatibility as part of the S3Tables in addition to current API?

IMO that would help accelerate the adoption a lot, otherwise all the Iceberg implementations will need to integrate with S3Tables separately and I have a feeling that maintenance will be non-trivial.

soumilshah1995 · 2024-12-17T14:25:40Z

+1

felixscherz mentioned this issue Dec 14, 2024

WIP: feat: support S3 Table Buckets with S3TablesCatalog #1429

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for S3 catalog to work with S3 Tables #1404

Support for S3 catalog to work with S3 Tables #1404

nicor88 commented Dec 5, 2024 •

edited

Loading

kevinjqliu commented Dec 5, 2024

kevinjqliu commented Dec 5, 2024

nlm4145 commented Dec 6, 2024

nicor88 commented Dec 9, 2024

felixscherz commented Dec 10, 2024

petehanssens commented Dec 11, 2024

nicor88 commented Dec 12, 2024 •

edited

Loading

felixscherz commented Dec 14, 2024 •

edited

Loading

jamesbornholt commented Dec 17, 2024

buremba commented Dec 17, 2024 •

edited

Loading

soumilshah1995 commented Dec 17, 2024

Support for S3 catalog to work with S3 Tables #1404

Support for S3 catalog to work with S3 Tables #1404

Comments

nicor88 commented Dec 5, 2024 • edited Loading

Feature Request / Improvement

More context

kevinjqliu commented Dec 5, 2024

kevinjqliu commented Dec 5, 2024

nlm4145 commented Dec 6, 2024

nicor88 commented Dec 9, 2024

felixscherz commented Dec 10, 2024

petehanssens commented Dec 11, 2024

nicor88 commented Dec 12, 2024 • edited Loading

felixscherz commented Dec 14, 2024 • edited Loading

jamesbornholt commented Dec 17, 2024

buremba commented Dec 17, 2024 • edited Loading

soumilshah1995 commented Dec 17, 2024

nicor88 commented Dec 5, 2024 •

edited

Loading

nicor88 commented Dec 12, 2024 •

edited

Loading

felixscherz commented Dec 14, 2024 •

edited

Loading

buremba commented Dec 17, 2024 •

edited

Loading