Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Spark SQL as a dialect #968

Merged
merged 30 commits into from
Dec 24, 2023
Merged

Allow Spark SQL as a dialect #968

merged 30 commits into from
Dec 24, 2023

Conversation

gilandose
Copy link

@gilandose gilandose commented Dec 18, 2023

Describe your changes

Making is possible to use basic functionality of the library with Spark, this should allow %%sql usage and returning a Spark DataFrame for further processing. %%sql this integrates with most of the libraries existing functionality and also introduces lazy_execution: This allows frameworks that support it to bypass the ResultSet in this library, and stay in native formats. This is useful when going back and forth between sql and python, and also for query validation without full execution when composing CTEs

Issue number

Related #536
Closes #965

Checklist before requesting a review


📚 Documentation preview 📚: https://jupysql--968.org.readthedocs.build/en/968/

@neelasha23
Copy link

I run the notebook and faced some missing dependency issues even after installing pyspark==3.4.1.

Screenshot 2023-12-19 at 10 50 32 AM Screenshot 2023-12-19 at 10 52 15 AM Screenshot 2023-12-19 at 10 57 48 AM

I had to install the following packages: grpcio, grpcio-status, google, protobuf @gilandose

src/tests/test_magic.py Outdated Show resolved Hide resolved
@gilandose
Copy link
Author

@neelasha23 I've added grpcio-status it provides all the other missing dependencies

@neelasha23
Copy link

please add a Changelog entry and ensure that the Ci is passing @gilandose

@gilandose gilandose changed the title Allow Spark Allow Spark SQL as a dialect Dec 19, 2023
@gilandose
Copy link
Author

Integration tests added should be good to go

@gilandose
Copy link
Author

yes, ready for review @neelasha23 , fixed linting and hopefully resolved CI environment variable look up issue, during integration

@neelasha23
Copy link

Copy link

@edublancas edublancas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work, mostly minor stuff.

give another stab to the integration tests, if you have difficulties to get them to work, let me know so someone on the team can help you!

doc/api/configuration.md Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
src/sql/error_handler.py Outdated Show resolved Hide resolved
src/sql/magic.py Show resolved Hide resolved
src/sql/plot.py Show resolved Hide resolved
src/sql/run/resultset.py Outdated Show resolved Hide resolved
src/sql/run/run.py Outdated Show resolved Hide resolved
src/sql/run/sparkdataframe.py Outdated Show resolved Hide resolved
src/sql/run/sparkdataframe.py Outdated Show resolved Hide resolved
@gilandose
Copy link
Author

gilandose commented Dec 21, 2023

great work, mostly minor stuff.

give another stab to the integration tests, if you have difficulties to get them to work, let me know so someone on the team can help you!

struggling to get the postgresSQL tests to run locally which seems to be the ones failing will have another attempt to see
@edublancas

@gilandose
Copy link
Author

postgreSQL tests passing locally, it was the print statements I'd left in error_handler! Let me know thoughts on lazy_execution?

Copy link

@edublancas edublancas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check our contribution guidelines: https://ploomber-contributing.readthedocs.io/en/latest/contributing/responding-pr-review.html

pasting a link to the commit with the changes simplifies reviewing

doc/integrations/spark.ipynb Show resolved Hide resolved
src/sql/error_handler.py Outdated Show resolved Hide resolved
src/sql/run/resultset.py Outdated Show resolved Hide resolved
src/sql/plot.py Show resolved Hide resolved
@edublancas
Copy link

@gilandose please check the failed CI tests

@gilandose
Copy link
Author

gilandose commented Dec 23, 2023

@gilandose please check the failed CI tests

Didn't realise there was a separate list in noxfile.py
Added

Should be everything addressed

@gilandose
Copy link
Author

@edublancas any idea why the docker containers for integration tests don't start automatically for me when running the integration tests locally?

Copy link

@edublancas edublancas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fix the connection.py file so the notebook passes and fix the CI

@edublancas
Copy link

@edublancas any idea why the docker containers for integration tests don't start automatically for me when running the integration tests locally?

no idea, are you seeing any issues? we've had other contributors encounter issues when running integration tests locally, we definitely need to improve the setup. I changed the repository settings so the CI runs whenever you push new code (previously, I had to approve every run). this should allow you to quickly test changes in the CI. the integration tests are passing. only the unit tests are failing. we're getting closer!

@edublancas edublancas merged commit f4088a3 into ploomber:master Dec 24, 2023
23 checks passed
@edublancas
Copy link

🎉 thanks a lot for working on this, this is great!

@edublancas
Copy link

I'll make a release now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Interested in Spark SQL support via spark connect?
3 participants