Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix databricks pandas error #1443

Merged
merged 10 commits into from
Jun 11, 2024
Merged

Fix databricks pandas error #1443

merged 10 commits into from
Jun 11, 2024

Conversation

steinitzu
Copy link
Collaborator

Description

Continued from #1436 without fork

Related Issues

  • Fixes #...
  • Closes #...
  • Resolves #...

Additional Context

Copy link

netlify bot commented Jun 5, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit fda4608
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6668397765a9f100086cc878

@steinitzu steinitzu changed the base branch from devel to master June 5, 2024 12:51
@steinitzu steinitzu force-pushed the sthor/fix-databricks-pandas-error branch from e3c5d68 to b22f4aa Compare June 5, 2024 12:52
@steinitzu steinitzu changed the base branch from master to devel June 5, 2024 13:41
@steinitzu steinitzu force-pushed the sthor/fix-databricks-pandas-error branch 4 times, most recently from 4cfbb87 to 06de0b3 Compare June 5, 2024 13:47
@steinitzu steinitzu force-pushed the sthor/fix-databricks-pandas-error branch from 06de0b3 to 41a8a6f Compare June 5, 2024 13:51
@steinitzu steinitzu marked this pull request as ready for review June 5, 2024 13:51
@rudolfix rudolfix added the bug Something isn't working label Jun 5, 2024
@@ -264,7 +264,7 @@ def __init__(self, schema: Schema, config: DatabricksClientConfiguration) -> Non
sql_client = DatabricksSqlClient(config.normalize_dataset_name(schema), config.credentials)
super().__init__(schema, config, sql_client)
self.config: DatabricksClientConfiguration = config
self.sql_client: DatabricksSqlClient = sql_client
self.sql_client: DatabricksSqlClient = sql_client # type: ignore[assignment]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did it work before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea tbh. Maybe something to do with 2.9 driver not having type hints? 3.x does.

@@ -37,7 +38,9 @@ def __init__(self, dataset_name: str, credentials: DatabricksCredentials) -> Non

def open_connection(self) -> DatabricksSqlConnection:
conn_params = self.credentials.to_connector_params()
self._conn = databricks_lib.connect(**conn_params, schema=self.dataset_name)
self._conn = databricks_lib.connect(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is my only issue here: could we support 2.9 driver? the 3.x has

paramstyle = "named"

and 2.9 pyformat so it is trivial to distinguish them
why: I bet there are people with constrained environments (ie Airflow) where upgrading dependencies is hard. and 3.0 was released only in march

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.9 still works if you have it installed. There's no change in our code except this argument that's ignored in 2.9.
paramstyle doesn't matter with use_inline_params, this makes 3.x work like 2.x which does not use parametrized queries at all, it's some custom escaping and %s string formatting under the hood. 🤷‍♂️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to switch to named paramstyle we can detect driver version and cluster version. But I don't know enough about it to tell if this is reliable in all environments: https://docs.databricks.com/en/sql/language-manual/functions/current_version.html

pyproject.toml Outdated
@@ -81,8 +81,8 @@ weaviate-client = {version = ">=3.22", optional = true}
adlfs = {version = ">=2022.4.0", optional = true}
pyodbc = {version = "^4.0.39", optional = true}
qdrant-client = {version = "^1.6.4", optional = true, extras = ["fastembed"]}
databricks-sql-connector = {version = ">=2.9.3,<3.0.0", optional = true}
dbt-databricks = {version = ">=1.7.3", optional = true}
databricks-sql-connector = {version = "^3", optional = true}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this is good

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I cleaned up dbt stuff (moved from extras to dependency group), allowed databricks > 2.9.3 bumped dependency to >3.

all seems to work together

@rudolfix rudolfix merged commit d4340d8 into devel Jun 11, 2024
49 of 50 checks passed
@rudolfix rudolfix deleted the sthor/fix-databricks-pandas-error branch June 11, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants