Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: spark connect #3581

Open
3 of 52 tasks
universalmind303 opened this issue Dec 16, 2024 · 2 comments
Open
3 of 52 tasks

EPIC: spark connect #3581

universalmind303 opened this issue Dec 16, 2024 · 2 comments
Labels

Comments

@universalmind303
Copy link
Contributor

universalmind303 commented Dec 16, 2024

spark connect

distributed execution

for distributed execution we need a ray runner that we can call from rust

  • create rust based shim around our existing python ray runner

We might need this?

  • move DaftContext into rust (this one should be relatively easy)

compatibility/interop

some of the text based methods (printSchema, show, explain) should have a spark compatibile output.

  • modify the to_comfy_table to be able to output a spark compatible df output.
  • alternative display implementation for Schema that matches spark's
  • create a new TreeDisplay implementation that somewhat matches spark's plans

pyspark.sql.DataFrame

pyspark.sql.Catalog

TODO (I don't think this is stabilized in spark connect yet)

pyspark.sql.functions

see https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html for list of functions

UDFS

spark UDF's should be mappable to our UDF's. They use a very similar pickling approach, and we'll likely just need to use their deserializer to deserialize them back into python. Likely a bit more discovery needed.

UX/DX

Documentation

  • Quick start guide for spark connect via daft (daft connect)
  • Distributed computing guide for daft connect
    • this should include a guide for how to set up a cluster and connect to it via spark.

Issue Tracking

Upstream Spark issues

@universalmind303 universalmind303 added enhancement New feature or request epic labels Dec 16, 2024
@jaychia
Copy link
Contributor

jaychia commented Dec 16, 2024

WRT to the catalogs, @universalmind303 what do you think of starting to unify around the DaftMetaCatalog that I introduced in #3036?

I think we have a few competing standards atm (including the SQLCatalog). It could be good to start having a catalog abstraction that can be shared across our different frontends

@universalmind303
Copy link
Contributor Author

universalmind303 commented Dec 16, 2024

WRT to the catalogs, @universalmind303 what do you think of starting to unify around the DaftMetaCatalog that I introduced in #3036?

I think we have a few competing standards atm (including the SQLCatalog). It could be good to start having a catalog abstraction that can be shared across our different frontends

yes, that is something I want to do and have been thinking about. I'll open up an issue to unify daft.catalog and daft.sql.catalog as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants