[BUGFIX] Support Spark connect dataframes #10420

tyler-hoffman · 2024-09-18T15:43:34Z

We previously did isinstance checks on spark dataframes and only allowed non-connect dataframes. This fixes that.

NOTE: This does not work if the spark session was created via the SparkConnectSession's factory methods. This is documented in an xfailed test.

Description of PR changes above includes a link to an existing GitHub issue
PR title is prefixed with one of: [BUGFIX], [FEATURE], [DOCS], [MAINTENANCE], [CONTRIB]
Code is linted - run invoke lint (uses ruff format + ruff check)
Appropriate tests and docs have been updated

For more information about contributing, see Contribute.

After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!

netlify · 2024-09-18T15:43:48Z

✅ Deploy Preview for niobium-lead-7998 canceled.

Name	Link
🔨 Latest commit	`f619c5e`
🔍 Latest deploy log	https://app.netlify.com/sites/niobium-lead-7998/deploys/66eb75698ae9e60008e1ea7f

codecov · 2024-09-18T16:12:07Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.96%. Comparing base (dec5dce) to head (f619c5e).
Report is 1 commits behind head on develop.

✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop   #10420   +/-   ##
========================================
  Coverage    79.95%   79.96%           
========================================
  Files          459      459           
  Lines        39968    39976    +8     
========================================
+ Hits         31957    31965    +8     
  Misses        8011     8011

Flag	Coverage Δ
3.10	`66.73% <70.00%> (+0.01%)`	⬆️
3.10 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.10 aws_deps	`?`
3.10 big	`?`
3.10 filesystem	`?`
3.10 mssql	`?`
3.10 mysql	`?`
3.10 postgresql	`?`
3.10 spark	`?`
3.10 trino	`?`
3.11	`66.73% <70.00%> (-0.01%)`	⬇️
3.11 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.11 aws_deps	`?`
3.11 big	`?`
3.11 filesystem	`?`
3.11 mssql	`?`
3.11 mysql	`?`
3.11 postgresql	`?`
3.11 spark	`?`
3.11 trino	`?`
3.12	`65.33% <70.00%> (+<0.01%)`	⬆️
3.12 aws_deps	`45.88% <70.00%> (+<0.01%)`	⬆️
3.12 big	`54.50% <70.00%> (+<0.01%)`	⬆️
3.12 filesystem	`60.99% <70.00%> (+<0.01%)`	⬆️
3.12 mssql	`49.97% <70.00%> (+<0.01%)`	⬆️
3.12 mysql	`50.03% <70.00%> (+<0.01%)`	⬆️
3.12 postgresql	`54.29% <70.00%> (+<0.01%)`	⬆️
3.12 spark	`57.80% <100.00%> (+<0.01%)`	⬆️
3.12 spark_connect	`46.17% <80.00%> (?)`
3.12 trino	`52.42% <70.00%> (+<0.01%)`	⬆️
3.8	`66.77% <70.00%> (+<0.01%)`	⬆️
3.8 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`55.09% <70.00%> (+<0.01%)`	⬆️
3.8 aws_deps	`45.91% <70.00%> (+<0.01%)`	⬆️
3.8 big	`54.52% <70.00%> (+<0.01%)`	⬆️
3.8 databricks	`47.60% <70.00%> (+<0.01%)`	⬆️
3.8 filesystem	`61.00% <70.00%> (+<0.01%)`	⬆️
3.8 mssql	`49.96% <70.00%> (+<0.01%)`	⬆️
3.8 mysql	`50.02% <70.00%> (+<0.01%)`	⬆️
3.8 postgresql	`54.27% <70.00%> (+<0.01%)`	⬆️
3.8 snowflake	`48.46% <70.00%> (+<0.01%)`	⬆️
3.8 spark	`57.77% <100.00%> (+<0.01%)`	⬆️
3.8 spark_connect	`46.18% <80.00%> (?)`
3.8 trino	`52.41% <70.00%> (+<0.01%)`	⬆️
3.9	`66.75% <70.00%> (+<0.01%)`	⬆️
3.9 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.9 aws_deps	`?`
3.9 big	`?`
3.9 filesystem	`?`
3.9 mssql	`?`
3.9 mysql	`?`
3.9 postgresql	`?`
3.9 spark	`?`
3.9 trino	`?`
cloud	`0.00% <0.00%> (ø)`
docs-basic	`52.43% <70.00%> (+<0.01%)`	⬆️
docs-creds-needed	`52.71% <70.00%> (+<0.01%)`	⬆️
docs-spark	`52.10% <70.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

billdirks · 2024-09-18T17:52:33Z

tests/integration/spark/test_spark_connect.py

+def test_spark_connect_with_bad_factory_method(
+    spark_validation_definition: ValidationDefinition,
+):
+    """The purpose of this test is to document an issue with using SparkConnectSession's


If we need to run this test in isolation since it uses spark connect and the other tests don't we could do the following:

Add a spark_connect marker to this test/conftest.py REQUIRED_MARKERS list.

I think you'll need to a key/value to tasks.py MARKER_DEPENDENCY_MAP with the key spark_connect whose value is the same as spark.

Update ci.yml to add spark_connect to the marker list.

We'll then need to make it a required marker in the github UI.

Do you it's worth it? Happy to do those updates if you want, and I had thought about it earlier, but I decided against it because:

If anything, the spark_connect marker would probably apply to all tests in this file, so we'd probably want another to just be spark_connect_without_a_session or something. We could do it, but it seems potentially confusing to me to have a marker that would really just apply to the one file

I mostly just added the test because it was more clear (IMO) that adding a comment in the description.
So I personally think it's fine to keep it as is, but if you lean toward making sure it runs and fails, I can add another marker.

In this case, I don't see much value to being alerted in the unlikely case that this started passing

But if you lean toward it, I'm happy to do it!

Well, ended up needing a separate flag for spark_connect to just support running these tests (otherwise CI errored saying we already had a regular spark session, so couldn't get a spark connect one). So added that. There were a couple other changes that needed to be made as well. But @billdirks I think I'll have to bug you about making it required in the GH UI.

Does this test always fail now? Wondering if we can update the strict value and the docstring?

Turns out it now passes with this mark. I was very confident it was not when I was running both tests with the other mark. This was the fundamental problem I was hitting when testing this lasts week too. That doesn't make sense, I'm not yet sure on the specifics of what is different, but I'm no longer xfailing the test. I'll mess around with this a bit tomorrow.

This reverts commit 6addbb3.

This reverts commit edae90a.

billdirks

Thanks for getting to the bottom of this!

billdirks · 2024-09-19T00:21:32Z

tests/integration/spark/test_spark_connect.py

+def test_spark_connect_with_bad_factory_method(
+    spark_validation_definition: ValidationDefinition,
+):
+    """The purpose of this test is to document an issue with using SparkConnectSession's


Does this test always fail now? Wondering if we can update the strict value and the docstring?

tyler-hoffman added 4 commits September 18, 2024 11:30

Add service in docker for spark connect

3e4e677

Allow spark or spark.connect DataFrames

88f5625

Backfill test

d4cfefc

Minor cleanup

524f228

tyler-hoffman added 2 commits September 18, 2024 11:59

Add xfailed test to document the bug I previously hit

8149081

Fix TypeGuard

c2e1625

tyler-hoffman and others added 5 commits September 18, 2024 12:22

Add a test on the error

485b99c

Xfail the test

3d478af

Merge branch 'develop' into b/v1-506/spark-connect-dataframes-2

34ab4e2

Fix import

db6290e

Update the type check

1b04efe

billdirks reviewed Sep 18, 2024

View reviewed changes

tyler-hoffman added 6 commits September 18, 2024 14:23

Simplify some logic

edae90a

Double check that a mark decorator doesn't fix CI

6addbb3

Revert "Double check that a mark decorator doesn't fix CI"

1797859

This reverts commit 6addbb3.

Revert "Simplify some logic"

1514d44

This reverts commit edae90a.

Update requirements to work with spark connect

4e7c2a7

Link to cleanup ticket

4167f4a

tyler-hoffman force-pushed the b/v1-506/spark-connect-dataframes-2 branch from bc10baa to 4167f4a Compare September 18, 2024 20:10

tyler-hoffman and others added 3 commits September 18, 2024 16:10

Merge branch 'develop' into b/v1-506/spark-connect-dataframes-2

5e0a536

Add separate mark for spark-connect

6802747

One more fix for CI

2bcb177

tyler-hoffman requested a review from billdirks September 18, 2024 21:02

billdirks approved these changes Sep 19, 2024

View reviewed changes

Fix test

4dc0537

tyler-hoffman force-pushed the b/v1-506/spark-connect-dataframes-2 branch from ed77586 to 4dc0537 Compare September 19, 2024 00:50

tyler-hoffman and others added 2 commits September 18, 2024 20:50

Add back a requirements file

401b527

Merge branch 'develop' into b/v1-506/spark-connect-dataframes-2

f619c5e

tyler-hoffman added this pull request to the merge queue Sep 19, 2024

Merged via the queue into develop with commit 5b2a969 Sep 19, 2024
69 checks passed

tyler-hoffman deleted the b/v1-506/spark-connect-dataframes-2 branch September 19, 2024 12:45

tyler-hoffman mentioned this pull request Sep 19, 2024

Bad input to build_batch_request: Can not build batch request for dataframe asset without a dataframe for Spark dataframe #10304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] Support Spark connect dataframes #10420

[BUGFIX] Support Spark connect dataframes #10420

tyler-hoffman commented Sep 18, 2024 •

edited

Loading

netlify bot commented Sep 18, 2024 •

edited

Loading

codecov bot commented Sep 18, 2024 •

edited

Loading

billdirks Sep 18, 2024

tyler-hoffman Sep 18, 2024

tyler-hoffman Sep 18, 2024

billdirks Sep 19, 2024

tyler-hoffman Sep 19, 2024

billdirks left a comment

billdirks Sep 19, 2024

[BUGFIX] Support Spark connect dataframes #10420

[BUGFIX] Support Spark connect dataframes #10420

Conversation

tyler-hoffman commented Sep 18, 2024 • edited Loading

netlify bot commented Sep 18, 2024 • edited Loading

✅ Deploy Preview for niobium-lead-7998 canceled.

codecov bot commented Sep 18, 2024 • edited Loading

Codecov Report

billdirks Sep 18, 2024

Choose a reason for hiding this comment

tyler-hoffman Sep 18, 2024

Choose a reason for hiding this comment

tyler-hoffman Sep 18, 2024

Choose a reason for hiding this comment

billdirks Sep 19, 2024

Choose a reason for hiding this comment

tyler-hoffman Sep 19, 2024

Choose a reason for hiding this comment

billdirks left a comment

Choose a reason for hiding this comment

billdirks Sep 19, 2024

Choose a reason for hiding this comment

tyler-hoffman commented Sep 18, 2024 •

edited

Loading

netlify bot commented Sep 18, 2024 •

edited

Loading

codecov bot commented Sep 18, 2024 •

edited

Loading