Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not deprecate Botocore Session in upcoming release (0.8) #1104

Closed
BTheunissen opened this issue Aug 27, 2024 · 11 comments · Fixed by #1300 · May be fixed by #1299
Closed

Do not deprecate Botocore Session in upcoming release (0.8) #1104

BTheunissen opened this issue Aug 27, 2024 · 11 comments · Fixed by #1300 · May be fixed by #1299

Comments

@BTheunissen
Copy link

BTheunissen commented Aug 27, 2024

Feature Request / Improvement

The AWS parameter botocore_session has been flagged as deprecated as of #922, and is due to be removed at Milestone 0.8.

I'd like to request that this parameter is not deprecated, and I'd be happy to add a PR to bring the credential name in-line with the rest of the updated client configuration. botocore_session is helpful to make available to override in order to support automatically refreshable credentials for long-running jobs.

For example in my project I have the following boto3 utility code:

from boto3 import Session
from botocore.credentials import (
    AssumeRoleCredentialFetcher,
    Credentials,
    DeferredRefreshableCredentials,
)
from botocore.session import Session as BotoSession

def get_refreshable_botocore_session(
    source_credentials: Credentials | None,
    assume_role_arn: str,
    role_session_name: str | None = None,
) -> BotoSession:
    """Get a refreshable botocore session for assuming a role."""
    if source_credentials is not None:
        boto3_session = Session(
            aws_access_key_id=source_credentials.access_key,
            aws_secret_access_key=source_credentials.secret_key,
            aws_session_token=source_credentials.token,
        )
    else:
        boto3_session = Session()

    extra_args = {}
    if role_session_name:
        extra_args["RoleSessionName"] = role_session_name
    fetcher = AssumeRoleCredentialFetcher(
        client_creator=boto3_session.client,
        source_credentials=source_credentials,
        role_arn=assume_role_arn,
        extra_args={},
    )
    refreshable_credentials = DeferredRefreshableCredentials(
        method="assume-role",
        refresh_using=fetcher.fetch_credentials,
    )
    botocore_session = BotoSession()
    botocore_session._credentials = refreshable_credentials  # noqa: SLF001
    return botocore_session

Which can be used as follows:

credentials = Credentials(
    access_key=client_access_key_id,
    secret_key=client_secret_access_key,
    token=client_session_token,
)
botocore_session = get_refreshable_botocore_session(
    source_credentials=credentials,
    assume_role_arn=self.config["client_iam_role_arn"],
)
catalog_properties["botocore_session"] = botocore_session
load_catalog(**catalog_properties)

This allows the user to elapse over the IAM role-chaining limitation of 1 hour, very useful for reading extremely large tables.

I'd also like to contribute some of this code upstream at some point to support refreshable botocore sessions in both the AWS Glue/DynamoDB clients, as well as the underlying S3 file system code.

@kevinjqliu
Copy link
Contributor

Thanks for raising this issue @BTheunissen

botocore_session is helpful to make available to override in order to support automatically refreshable credentials for long-running jobs.
...
This allows the user to elapse over the IAM role-chaining limitation of 1 hour, very useful for reading extremely large tables.

The ability to refresh AWS credentials is important for long-running jobs. Let's open a ticket to track this feature.

See this comment for the reason to deprecate botocore_session.
I wonder if there's another way to implement automatically refreshable credentials without using botocore_session.

@BTheunissen
Copy link
Author

BTheunissen commented Sep 2, 2024

@kevinjqliu Definitely fair enough that the reason for deprecation being that the catalog settings are generally exposed as a Dict[str, str] and the botocore.Session object being passed in breaks this convention.

I'd be fine removing if the ticket to track credential refresh was written up, I'd take a crack at implementing it but honestly the workarounds I've had to do to support it for both the Python boto clients, and the underlying filesystem implementations is pretty hacky, there are some existing issues on the same topic open against the Arrow project as the guidance from AWS on properly supporting refreshable credentials is very spotty.

@kevinjqliu
Copy link
Contributor

@BTheunissen +1, opened #1129 to track this feature. It can be hacky for now. This feature is generally nice to have for the project

@cshenrik
Copy link

@BTheunissen, I'm in the same situation as you, trying to use Pyiceberg with automatically refreshable AWS credentials. Would you be able to share how you made this work with the current version of Pyiceberg? The glue catalog picks up the session correctly, but it doesn't use it for accessing S3.

@kevinjqliu
Copy link
Contributor

The glue catalog picks up the session correctly, but it doesn't use it for accessing S3.

you can either set glue and s3 credentials separately or use the unified AWS credential configs
https://py.iceberg.apache.org/configuration/#unified-aws-credentials

@cshenrik
Copy link

The glue catalog picks up the session correctly, but it doesn't use it for accessing S3.

you can either set glue and s3 credentials separately or use the unified AWS credential configs https://py.iceberg.apache.org/configuration/#unified-aws-credentials

I'm setting botocore_session (which is now deprecated), but S3 doesn't use it. The OP mentions that he had to make some pretty hacky workarounds to make the filesystem implementations pick up the botocore_session. I am hoping those workarounds could be shared here.

@BTheunissen
Copy link
Author

BTheunissen commented Sep 26, 2024

@cshenrik Sorry about the lateness, I actually did a small internal fork of the library and added the following logic to io/pyarrow:

def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem:
        if scheme in {"s3", "s3a", "s3n"}:
            from pyarrow.fs import S3FileSystem

            client_kwargs: Dict[str, Any] = {
                "endpoint_override": self.properties.get(S3_ENDPOINT),
                "access_key": get_first_property_value(self.properties, S3_ACCESS_KEY_ID, AWS_ACCESS_KEY_ID),
                "secret_key": get_first_property_value(self.properties, S3_SECRET_ACCESS_KEY, AWS_SECRET_ACCESS_KEY),
                "session_token": get_first_property_value(self.properties, S3_SESSION_TOKEN, AWS_SESSION_TOKEN),
                "region": get_first_property_value(self.properties, S3_REGION, AWS_REGION),
            }

            if proxy_uri := self.properties.get(S3_PROXY_URI):
                client_kwargs["proxy_options"] = proxy_uri

            if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT):
                client_kwargs["connect_timeout"] = float(connect_timeout)

            if role_arn := self.properties.get(AWS_ROLE_ARN):
                client_kwargs["role_arn"] = role_arn

            if session_name := self.properties.get(AWS_SESSION_NAME):
                client_kwargs["session_name"] = session_name

            return S3FileSystem(**client_kwargs)

Passing the role_arn and session_name will let the S3 File System automatically refresh the credentials of the AWS C++ client used by the PyArrow file system, pretty tedious but working so far!

@cshenrik
Copy link

Thanks for sharing that, @BTheunissen.

I have to call a bespoke webservice for retrieving AWS credentials, so I can't use that implementation directly, but it's still good to see what others did.

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Nov 6, 2024

#1296 added the option to pass role_arn and session_name to pyarrow.fs.S3FileSystem

Passing the role_arn and session_name will let the S3 File System automatically refresh the credentials of the AWS C++ client used by the PyArrow file system, pretty tedious but working so far!

@BTheunissen do you know if passing the role_arn will automatically refresh S3 credentials for long running jobs?

For pyarrow doc just mentions

AWS Role ARN. If provided instead of access_key and secret_key, temporary credentials will be fetched by assuming this role.

@matteosimone
Copy link

@cshenrik I have the same issue of trying to utilize AWS profile that hits a web service to drive automatically refreshable credentials. Did you find any solution to this?

@cshenrik
Copy link

@cshenrik I have the same issue of trying to utilize AWS profile that hits a web service to drive automatically refreshable credentials. Did you find any solution to this?

@matteosimone , I have no solution yet, but the discussion in this PR gives me hope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants