Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threading issues on information_schema queries #43

Closed
jessedobbelaere opened this issue Nov 20, 2022 · 3 comments · Fixed by #45
Closed

Multi-threading issues on information_schema queries #43

jessedobbelaere opened this issue Nov 20, 2022 · 3 comments · Fixed by #45

Comments

@jessedobbelaere
Copy link
Contributor

jessedobbelaere commented Nov 20, 2022

I'm switching a few projects to this dbt-athena community adapter and I notice a regression bug 🐛

Locally, I use threads: 4. I have a few databases defined in my dbt_project.yml:

models:
  test_jesse:
    silver:
      test_a:
        +schema: silver_test_a
      test_b:
        +schema: silver_test_b
      test_c:
        +schema: silver_test_c

When I run dbt --debug run I see it starts by making 3 parallel queries to INFORMATION_SCHEMA (corresponding to the 3 custom schemas)

Output of the failing `dbt --debug run`:
23:22:00.639968 [info ] [MainThread]: Found 79 models, 9 tests, 0 snapshots, 0 analyses, 498 macros, 0 operations, 0 seed files, 64 sources, 0 exposures, 0 metrics
23:22:00.642584 [info ] [MainThread]:
23:22:00.642973 [debug] [MainThread]: Acquiring new athena connection "master"
23:22:00.645492 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:22:00.652730 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:22:00.654241 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:22:00.655511 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:22:00.655686 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:22:00.655861 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:22:00.656061 [debug] [ThreadPool]: On list_awsdatacatalog: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:22:00.656238 [debug] [ThreadPool]: On list_awsdatacatalog: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:22:00.656446 [debug] [ThreadPool]: On list_awsdatacatalog: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:22:00.656774 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:22:00.660507 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:22:00.673268 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:22:00.675392 [error] [ThreadPool]: Athena adapter: Got an error when attempting to open a Athena connection due to 'credential_provider'
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.9/site-packages/dbt/adapters/athena/connections.py", line 155, in open
    handle = AthenaConnection(
  File "/opt/homebrew/lib/python3.9/site-packages/pyathena/connection.py", line 139, in __init__
    self._client = self._session.client(
  File "/opt/homebrew/lib/python3.9/site-packages/boto3/session.py", line 299, in client
    return self._session.create_client(
  File "/opt/homebrew/lib/python3.9/site-packages/botocore/session.py", line 951, in create_client
    credentials = self.get_credentials()
  File "/opt/homebrew/lib/python3.9/site-packages/botocore/session.py", line 507, in get_credentials
    self._credentials = self._components.get_component(
  File "/opt/homebrew/lib/python3.9/site-packages/botocore/session.py", line 1112, in get_component
    del self._deferred[name]
KeyError: 'credential_provider'
23:22:00.688091 [debug] [ThreadPool]: Athena adapter: Error running SQL: /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:22:00.688642 [debug] [ThreadPool]: Athena adapter: Error running SQL: macro list_schemas
23:22:00.688916 [debug] [ThreadPool]: On list_awsdatacatalog: No close available on handle
23:22:07.156437 [debug] [ThreadPool]: SQL status: OK -1 in 6.5 seconds
23:22:07.158140 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:22:07.159025 [debug] [ThreadPool]: SQL status: OK -1 in 6.5 seconds
23:22:07.159950 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:22:07.160520 [debug] [MainThread]: Connection 'master' was properly closed.
23:22:07.160681 [debug] [MainThread]: Connection 'list_awsdatacatalog' was properly closed.
23:22:07.160892 [debug] [MainThread]: Connection 'list_awsdatacatalog' was properly closed.
23:22:07.161071 [debug] [MainThread]: Connection 'list_awsdatacatalog' was properly closed.
23:22:07.161847 [debug] [MainThread]: Flushing usage events
23:22:07.162046 [error] [MainThread]: Encountered an error:
Runtime Error
  Runtime Error
    Database Error
      'credential_provider'

When I switch back to Tomme's adapter (dbt-athena-adapter==1.0.1), It makes three queries in parallel successfully.

Output of the failing `dbt --debug run`:
23:23:51.801549 [info ] [MainThread]: Found 79 models, 9 tests, 0 snapshots, 0 analyses, 490 macros, 0 operations, 0 seed files, 64 sources, 0 exposures, 0 metrics
23:23:51.804227 [info ] [MainThread]:
23:23:51.804618 [debug] [MainThread]: Acquiring new athena connection "master"
23:23:51.807446 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:23:51.814927 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:23:51.816278 [debug] [ThreadPool]: Acquiring new athena connection "list_awsdatacatalog"
23:23:51.817486 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:23:51.817641 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:23:51.817786 [debug] [ThreadPool]: Using athena connection "list_awsdatacatalog"
23:23:51.817940 [debug] [ThreadPool]: On list_awsdatacatalog: -- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:23:51.818153 [debug] [ThreadPool]: On list_awsdatacatalog: -- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:23:51.818459 [debug] [ThreadPool]: On list_awsdatacatalog: -- /* {"app": "dbt", "dbt_version": "1.3.1", "profile_name": "athena", "target_name": "dev", "connection_name": "list_awsdatacatalog"} */

    select
        distinct schema_name

    from awsdatacatalog.INFORMATION_SCHEMA.schemata

23:23:51.818715 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:23:51.828140 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:23:51.832063 [debug] [ThreadPool]: Opening a new connection, currently in state init
23:23:58.460654 [debug] [ThreadPool]: SQL status: OK -1 in 6.64 seconds
23:23:58.461749 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:23:58.465697 [debug] [ThreadPool]: SQL status: OK -1 in 6.65 seconds
23:23:58.466534 [debug] [ThreadPool]: On list_awsdatacatalog: Close
23:23:58.473755 [debug] [ThreadPool]: SQL status: OK -1 in 6.66 seconds
23:23:58.474907 [debug] [ThreadPool]: On list_awsdatacatalog: Close

When I use threads: 1 or threads: 2 it works and successfully deploys the dbt project, but it starts failing with 3 or more threads.

This might be a regression? I found a similar issue Tomme/dbt-athena#41

@jessedobbelaere
Copy link
Contributor Author

jessedobbelaere commented Nov 21, 2022

I could reproduce starting from this adapter's first release (1.0.4): https://github.com/dbt-athena/dbt-athena/pull/7/files
The PR that looks suspicious seems to be #16

I made a local change and reverted one line in connections.py and it deploys my project again with 8 threads 👌

            handle = AthenaConnection(
                s3_staging_dir=creds.s3_staging_dir,
                endpoint_url=creds.endpoint_url,
                schema_name=creds.schema,
                work_group=creds.work_group,
                cursor_class=AthenaCursor,
                formatter=AthenaParameterFormatter(),
                poll_interval=creds.poll_interval,
-               session=get_boto3_session(connection),
+               profile_name=creds.aws_profile_name,
                retry_config=RetryConfig(
                    attempt=creds.num_retries,
                    exceptions=(
                        "ThrottlingException",
                        "TooManyRequestsException",
                        "InternalServerException",
                    ),
                ),
            )

💡 If this only happens with the information_schema queries, we might fix this by calling the Glue API (faster) instead of querying information_schema in Athena? See suggestion: #4 (comment)

@nicor88
Copy link
Contributor

nicor88 commented Nov 21, 2022

Could be that the regression was introduced by Share same boto session between every calls, I honestly think that switching to Glue API is better.

@jessedobbelaere
Copy link
Contributor Author

jessedobbelaere commented Nov 21, 2022

I have a branch with the Glue API implemented instead of information_schema but the multi-threading error in botocore are still happening when I use more than 1-2 threads.

Found this to be very interesting:

The tl;dr is

... for boto3 you can see that they recommend one client per session as there is shared data between instance that can be mutated by individual threads.

So you need one session per thread. We use 1 global session right now, which is causing the multi-threading errors: https://github.com/dbt-athena/dbt-athena/blob/main/dbt/adapters/athena/session.py#L9-L15C10

Which was introduced in Tomme/dbt-athena#125
I can create a fix where we change session.py to the code below. This runs perfectly with 8 threads:

import boto3.session
from dbt.contracts.connection import Connection


def get_boto3_session(connection: Connection = None) -> boto3.session.Session:
    if connection is None:
        raise RuntimeError(
            "A Connection object needs to be passed to initialize the boto3 session"
        )

    return boto3.session.Session(
        region_name=connection.credentials.region_name,
        profile_name=connection.credentials.aws_profile_name,
    )

It still respects the profile_name but the session is no longer global. Each thread creates one AthenaConnection with one get_boto3_session 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants