Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when connecting to REST catalogs on AWS ( Amazon SageMaker Lakehouse) #1449

Open
1 of 3 tasks
Neuw84 opened this issue Dec 19, 2024 · 5 comments
Open
1 of 3 tasks

Comments

@Neuw84
Copy link

Neuw84 commented Dec 19, 2024

Apache Iceberg version

0.8.1 (latest release)

Please describe the bug 🐞

Hi,

I am trying to read and write from a RMS (Redshift Managed Storage) backed catalog on AWS. The thing is that I am trying configuring the REST catalog like this.

rest_catalog = load_catalog(
    "rms-demo.dev",
    **{
        "type": "rest",
        "uri": "https://glue.us-east-2.amazonaws.com/iceberg",
             "rest.sigv4-enabled": "true",
               "rest.signing-name": "glue",
               "rest.signing-region": "us-east-2"
    }
)

But the thing is that the calls to the catalog goes through the default account id catalog. Note the http call on the traceback. it should be showing "rms-demo.dev" on the url and instead is using the accountId.

Traceback (most recent call last):
  File "/home/ec2-user/my_env/lib64/python3.9/site-packages/pyiceberg/catalog/rest.py", line 599, in _create_table
    response.raise_for_status()
  File "/home/ec2-user/my_env/lib64/python3.9/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://glue.us-east-2.amazonaws.com/iceberg/v1/catalogs/accountId/namespaces/public/tables

The above exception was the direct cause of the following exception:

There is something in backend scenes on why it is injecting the AWs account id instead of using the catalog name? Tried to see on the code, but...

Thanks!!

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@corleyma
Copy link

corleyma commented Dec 19, 2024

Unless something changed since I last looked at this (possible), I don't think Glue catalog supports the Iceberg REST spec? There is a separate catalog client implementation in PyIceberg for Glue, in any case, and I assume that's what you want to use: https://py.iceberg.apache.org/reference/pyiceberg/catalog/glue/

Edit: Actually, ignore me, I see Redshift has something different going on and supposedly does support REST spec.

@Neuw84
Copy link
Author

Neuw84 commented Dec 20, 2024

yes, now we have an implementation of the REST catalog,

I know that reading from RMS data is not going to be possible yet but at least I was expecting that the HTTP calls to be redirected to the expected paths for the catalog. T

here should be something on the code causing that... ( I tried to search but didn´t found it).

@kevinjqliu
Copy link
Contributor

I have not tested this personally but from reading the AWS blog on connect Spark to AWS Glue Iceberg REST catalog, there are some configurations that are different from what I would expect.
https://aws.amazon.com/blogs/big-data/read-and-write-s3-iceberg-table-using-aws-glue-iceberg-rest-catalog-from-open-source-apache-spark/

import sys
import os
import time
from pyspark.sql import SparkSession

#Replace <aws_region> with AWS region name.
#Replace <aws_account_id> with AWS account ID.

spark = SparkSession.builder.appName('osspark') \
.config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160') \
.config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.config('spark.sql.defaultCatalog', 'spark_catalog') \
.config('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkCatalog') \
.config('spark.sql.catalog.spark_catalog.type', 'rest') \
.config('spark.sql.catalog.spark_catalog.uri','https://glue.<aws_region>.amazonaws.com/iceberg') \
.config('spark.sql.catalog.spark_catalog.warehouse','<aws_account_id>') \
.config('spark.sql.catalog.spark_catalog.rest.sigv4-enabled','true') \
.config('spark.sql.catalog.spark_catalog.rest.signing-name','glue') \
.config('spark.sql.catalog.spark_catalog.rest.signing-region', <aws_region>) \
.config('spark.sql.catalog.spark_catalog.io-impl','org.apache.iceberg.aws.s3.S3FileIO') \
.config('spark.hadoop.fs.s3a.aws.credentials.provider','org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider') \
.config('spark.sql.catalog.spark_catalog.rest-metrics-reporting-enabled','false') \
.getOrCreate()

Specifically, notice the warehouse parameter. It might help trying to replicate what this spark config is doing

@Neuw84
Copy link
Author

Neuw84 commented Dec 20, 2024

Well for normal s3 buckets is working well without the parameter (have a working script that writes and read via the rest catalog).

Will try to dig on why is doing those weird calls.

@kevinjqliu
Copy link
Contributor

can you share what you've tried that worked? Might be helpful to debug this further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants