-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] Support Spark connect dataframes #10420
Conversation
✅ Deploy Preview for niobium-lead-7998 canceled.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## develop #10420 +/- ##
========================================
Coverage 79.95% 79.96%
========================================
Files 459 459
Lines 39968 39976 +8
========================================
+ Hits 31957 31965 +8
Misses 8011 8011
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
def test_spark_connect_with_bad_factory_method( | ||
spark_validation_definition: ValidationDefinition, | ||
): | ||
"""The purpose of this test is to document an issue with using SparkConnectSession's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to run this test in isolation since it uses spark connect and the other tests don't we could do the following:
- Add a
spark_connect
marker to this test/conftest.py REQUIRED_MARKERS list. - I think you'll need to a key/value to tasks.py MARKER_DEPENDENCY_MAP with the key
spark_connect
whose value is the same asspark
. - Update ci.yml to add
spark_connect
to the marker list.
We'll then need to make it a required marker in the github UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you it's worth it? Happy to do those updates if you want, and I had thought about it earlier, but I decided against it because:
- If anything, the spark_connect marker would probably apply to all tests in this file, so we'd probably want another to just be spark_connect_without_a_session or something. We could do it, but it seems potentially confusing to me to have a marker that would really just apply to the one file
- I mostly just added the test because it was more clear (IMO) that adding a comment in the description.
So I personally think it's fine to keep it as is, but if you lean toward making sure it runs and fails, I can add another marker. - In this case, I don't see much value to being alerted in the unlikely case that this started passing
But if you lean toward it, I'm happy to do it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, ended up needing a separate flag for spark_connect
to just support running these tests (otherwise CI errored saying we already had a regular spark session, so couldn't get a spark connect one). So added that. There were a couple other changes that needed to be made as well. But @billdirks I think I'll have to bug you about making it required in the GH UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test always fail now? Wondering if we can update the strict
value and the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out it now passes with this mark. I was very confident it was not when I was running both tests with the other mark. This was the fundamental problem I was hitting when testing this lasts week too. That doesn't make sense, I'm not yet sure on the specifics of what is different, but I'm no longer xfailing the test. I'll mess around with this a bit tomorrow.
bc10baa
to
4167f4a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting to the bottom of this!
def test_spark_connect_with_bad_factory_method( | ||
spark_validation_definition: ValidationDefinition, | ||
): | ||
"""The purpose of this test is to document an issue with using SparkConnectSession's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test always fail now? Wondering if we can update the strict
value and the docstring?
ed77586
to
4dc0537
Compare
We previously did isinstance checks on spark dataframes and only allowed non-connect dataframes. This fixes that.
NOTE: This does not work if the spark session was created via the
SparkConnectSession
's factory methods. This is documented in an xfailed test.invoke lint
(usesruff format
+ruff check
)For more information about contributing, see Contribute.
After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!