Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] Support Spark connect dataframes #10420

Merged
merged 23 commits into from
Sep 19, 2024

Conversation

tyler-hoffman
Copy link
Contributor

@tyler-hoffman tyler-hoffman commented Sep 18, 2024

We previously did isinstance checks on spark dataframes and only allowed non-connect dataframes. This fixes that.

NOTE: This does not work if the spark session was created via the SparkConnectSession's factory methods. This is documented in an xfailed test.

  • Description of PR changes above includes a link to an existing GitHub issue
  • PR title is prefixed with one of: [BUGFIX], [FEATURE], [DOCS], [MAINTENANCE], [CONTRIB]
  • Code is linted - run invoke lint (uses ruff format + ruff check)
  • Appropriate tests and docs have been updated

For more information about contributing, see Contribute.

After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!

Copy link

netlify bot commented Sep 18, 2024

Deploy Preview for niobium-lead-7998 canceled.

Name Link
🔨 Latest commit f619c5e
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/66eb75698ae9e60008e1ea7f

Copy link

codecov bot commented Sep 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.96%. Comparing base (dec5dce) to head (f619c5e).
Report is 1 commits behind head on develop.

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop   #10420   +/-   ##
========================================
  Coverage    79.95%   79.96%           
========================================
  Files          459      459           
  Lines        39968    39976    +8     
========================================
+ Hits         31957    31965    +8     
  Misses        8011     8011           
Flag Coverage Δ
3.10 66.73% <70.00%> (+0.01%) ⬆️
3.10 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds ?
3.10 aws_deps ?
3.10 big ?
3.10 filesystem ?
3.10 mssql ?
3.10 mysql ?
3.10 postgresql ?
3.10 spark ?
3.10 trino ?
3.11 66.73% <70.00%> (-0.01%) ⬇️
3.11 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds ?
3.11 aws_deps ?
3.11 big ?
3.11 filesystem ?
3.11 mssql ?
3.11 mysql ?
3.11 postgresql ?
3.11 spark ?
3.11 trino ?
3.12 65.33% <70.00%> (+<0.01%) ⬆️
3.12 aws_deps 45.88% <70.00%> (+<0.01%) ⬆️
3.12 big 54.50% <70.00%> (+<0.01%) ⬆️
3.12 filesystem 60.99% <70.00%> (+<0.01%) ⬆️
3.12 mssql 49.97% <70.00%> (+<0.01%) ⬆️
3.12 mysql 50.03% <70.00%> (+<0.01%) ⬆️
3.12 postgresql 54.29% <70.00%> (+<0.01%) ⬆️
3.12 spark 57.80% <100.00%> (+<0.01%) ⬆️
3.12 spark_connect 46.17% <80.00%> (?)
3.12 trino 52.42% <70.00%> (+<0.01%) ⬆️
3.8 66.77% <70.00%> (+<0.01%) ⬆️
3.8 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds 55.09% <70.00%> (+<0.01%) ⬆️
3.8 aws_deps 45.91% <70.00%> (+<0.01%) ⬆️
3.8 big 54.52% <70.00%> (+<0.01%) ⬆️
3.8 databricks 47.60% <70.00%> (+<0.01%) ⬆️
3.8 filesystem 61.00% <70.00%> (+<0.01%) ⬆️
3.8 mssql 49.96% <70.00%> (+<0.01%) ⬆️
3.8 mysql 50.02% <70.00%> (+<0.01%) ⬆️
3.8 postgresql 54.27% <70.00%> (+<0.01%) ⬆️
3.8 snowflake 48.46% <70.00%> (+<0.01%) ⬆️
3.8 spark 57.77% <100.00%> (+<0.01%) ⬆️
3.8 spark_connect 46.18% <80.00%> (?)
3.8 trino 52.41% <70.00%> (+<0.01%) ⬆️
3.9 66.75% <70.00%> (+<0.01%) ⬆️
3.9 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds ?
3.9 aws_deps ?
3.9 big ?
3.9 filesystem ?
3.9 mssql ?
3.9 mysql ?
3.9 postgresql ?
3.9 spark ?
3.9 trino ?
cloud 0.00% <0.00%> (ø)
docs-basic 52.43% <70.00%> (+<0.01%) ⬆️
docs-creds-needed 52.71% <70.00%> (+<0.01%) ⬆️
docs-spark 52.10% <70.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

def test_spark_connect_with_bad_factory_method(
spark_validation_definition: ValidationDefinition,
):
"""The purpose of this test is to document an issue with using SparkConnectSession's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to run this test in isolation since it uses spark connect and the other tests don't we could do the following:

  1. Add a spark_connect marker to this test/conftest.py REQUIRED_MARKERS list.
  2. I think you'll need to a key/value to tasks.py MARKER_DEPENDENCY_MAP with the key spark_connect whose value is the same as spark.
  3. Update ci.yml to add spark_connect to the marker list.

We'll then need to make it a required marker in the github UI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you it's worth it? Happy to do those updates if you want, and I had thought about it earlier, but I decided against it because:

  • If anything, the spark_connect marker would probably apply to all tests in this file, so we'd probably want another to just be spark_connect_without_a_session or something. We could do it, but it seems potentially confusing to me to have a marker that would really just apply to the one file
  • I mostly just added the test because it was more clear (IMO) that adding a comment in the description.
    So I personally think it's fine to keep it as is, but if you lean toward making sure it runs and fails, I can add another marker.
  • In this case, I don't see much value to being alerted in the unlikely case that this started passing

But if you lean toward it, I'm happy to do it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, ended up needing a separate flag for spark_connect to just support running these tests (otherwise CI errored saying we already had a regular spark session, so couldn't get a spark connect one). So added that. There were a couple other changes that needed to be made as well. But @billdirks I think I'll have to bug you about making it required in the GH UI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test always fail now? Wondering if we can update the strict value and the docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it now passes with this mark. I was very confident it was not when I was running both tests with the other mark. This was the fundamental problem I was hitting when testing this lasts week too. That doesn't make sense, I'm not yet sure on the specifics of what is different, but I'm no longer xfailing the test. I'll mess around with this a bit tomorrow.

@tyler-hoffman tyler-hoffman force-pushed the b/v1-506/spark-connect-dataframes-2 branch from bc10baa to 4167f4a Compare September 18, 2024 20:10
Copy link
Contributor

@billdirks billdirks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting to the bottom of this!

def test_spark_connect_with_bad_factory_method(
spark_validation_definition: ValidationDefinition,
):
"""The purpose of this test is to document an issue with using SparkConnectSession's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test always fail now? Wondering if we can update the strict value and the docstring?

@tyler-hoffman tyler-hoffman force-pushed the b/v1-506/spark-connect-dataframes-2 branch from ed77586 to 4dc0537 Compare September 19, 2024 00:50
@tyler-hoffman tyler-hoffman added this pull request to the merge queue Sep 19, 2024
Merged via the queue into develop with commit 5b2a969 Sep 19, 2024
69 checks passed
@tyler-hoffman tyler-hoffman deleted the b/v1-506/spark-connect-dataframes-2 branch September 19, 2024 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants