Issue when connecting to REST catalogs on AWS ( Amazon SageMaker Lakehouse) #1449

Neuw84 · 2024-12-19T17:32:32Z

Apache Iceberg version

0.8.1 (latest release)

Please describe the bug 🐞

Hi,

I am trying to read and write from a RMS (Redshift Managed Storage) backed catalog on AWS. The thing is that I am trying configuring the REST catalog like this.

rest_catalog = load_catalog(
    "rms-demo.dev",
    **{
        "type": "rest",
        "uri": "https://glue.us-east-2.amazonaws.com/iceberg",
             "rest.sigv4-enabled": "true",
               "rest.signing-name": "glue",
               "rest.signing-region": "us-east-2"
    }
)

But the thing is that the calls to the catalog goes through the default account id catalog. Note the http call on the traceback. it should be showing "rms-demo.dev" on the url and instead is using the accountId.

Traceback (most recent call last):
  File "/home/ec2-user/my_env/lib64/python3.9/site-packages/pyiceberg/catalog/rest.py", line 599, in _create_table
    response.raise_for_status()
  File "/home/ec2-user/my_env/lib64/python3.9/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://glue.us-east-2.amazonaws.com/iceberg/v1/catalogs/accountId/namespaces/public/tables

The above exception was the direct cause of the following exception:

There is something in backend scenes on why it is injecting the AWs account id instead of using the catalog name? Tried to see on the code, but...

Thanks!!

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

The text was updated successfully, but these errors were encountered:

corleyma · 2024-12-19T18:06:20Z

Unless something changed since I last looked at this (possible), I don't think Glue catalog supports the Iceberg REST spec? There is a separate catalog client implementation in PyIceberg for Glue, in any case, and I assume that's what you want to use: https://py.iceberg.apache.org/reference/pyiceberg/catalog/glue/

Edit: Actually, ignore me, I see Redshift has something different going on and supposedly does support REST spec.

Neuw84 · 2024-12-20T12:52:43Z

yes, now we have an implementation of the REST catalog,

I know that reading from RMS data is not going to be possible yet but at least I was expecting that the HTTP calls to be redirected to the expected paths for the catalog. T

here should be something on the code causing that... ( I tried to search but didn´t found it).

kevinjqliu · 2024-12-20T15:43:08Z

I have not tested this personally but from reading the AWS blog on connect Spark to AWS Glue Iceberg REST catalog, there are some configurations that are different from what I would expect.
https://aws.amazon.com/blogs/big-data/read-and-write-s3-iceberg-table-using-aws-glue-iceberg-rest-catalog-from-open-source-apache-spark/

import sys
import os
import time
from pyspark.sql import SparkSession

#Replace <aws_region> with AWS region name.
#Replace <aws_account_id> with AWS account ID.

spark = SparkSession.builder.appName('osspark') \
.config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160') \
.config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.config('spark.sql.defaultCatalog', 'spark_catalog') \
.config('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkCatalog') \
.config('spark.sql.catalog.spark_catalog.type', 'rest') \
.config('spark.sql.catalog.spark_catalog.uri','https://glue.<aws_region>.amazonaws.com/iceberg') \
.config('spark.sql.catalog.spark_catalog.warehouse','<aws_account_id>') \
.config('spark.sql.catalog.spark_catalog.rest.sigv4-enabled','true') \
.config('spark.sql.catalog.spark_catalog.rest.signing-name','glue') \
.config('spark.sql.catalog.spark_catalog.rest.signing-region', <aws_region>) \
.config('spark.sql.catalog.spark_catalog.io-impl','org.apache.iceberg.aws.s3.S3FileIO') \
.config('spark.hadoop.fs.s3a.aws.credentials.provider','org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider') \
.config('spark.sql.catalog.spark_catalog.rest-metrics-reporting-enabled','false') \
.getOrCreate()

Specifically, notice the warehouse parameter. It might help trying to replicate what this spark config is doing

Neuw84 · 2024-12-20T17:07:01Z

Well for normal s3 buckets is working well without the parameter (have a working script that writes and read via the rest catalog).

Will try to dig on why is doing those weird calls.

kevinjqliu · 2024-12-20T17:10:36Z

can you share what you've tried that worked? Might be helpful to debug this further

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue when connecting to REST catalogs on AWS ( Amazon SageMaker Lakehouse) #1449

Issue when connecting to REST catalogs on AWS ( Amazon SageMaker Lakehouse) #1449

Neuw84 commented Dec 19, 2024 •

edited

Loading

corleyma commented Dec 19, 2024 •

edited

Loading

Neuw84 commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

Neuw84 commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

Issue when connecting to REST catalogs on AWS ( Amazon SageMaker Lakehouse) #1449

Issue when connecting to REST catalogs on AWS ( Amazon SageMaker Lakehouse) #1449

Comments

Neuw84 commented Dec 19, 2024 • edited Loading

Apache Iceberg version

Please describe the bug 🐞

Willingness to contribute

corleyma commented Dec 19, 2024 • edited Loading

Neuw84 commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

Neuw84 commented Dec 20, 2024

kevinjqliu commented Dec 20, 2024

Neuw84 commented Dec 19, 2024 •

edited

Loading

corleyma commented Dec 19, 2024 •

edited

Loading