-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(spark): Support local filesystem without sparksession #554
Comments
There is a serious problem here that is difficult to avoid. GraphAr-commons relies on Spark's Dataframe, and the implementation of Dataframe requires SparkSession, so we cannot construct Dataframe directly without SparkSession. |
Hi @jasinliu We are about to let spark and java library to rely on java-info module(rather than the existing ffi-info or scala-info), which do not need spark session: incubator-graphar/maven-projects/info/src/main/java/org/apache/graphar/info/GraphInfo.java Lines 99 to 121 in 270f333
|
I see. Thanks for the reminder. Now I need this |
Actually, the library is here, but this library is under development(basic classes were done, and need unit test and refactor by protobuf). I am working on refactor by protobuf as issue #539 Since we had decided to unify the info for all library by protobuf. For java/scala, the ffi-info and sclala-info well be deprecated. So I suggest you to rely on the API of Besides, maybe we need suggestion form @acezen |
What is the idea behind pushing #561 to main? If for some reason we need a version of scala library without spark, it can be done in a separate persistent branch. Also, I do not fully understand the problem. Under the hood Spark is relying on the If the problem is in dependency hell, why not to solve it? There are a lot of tools for that in the JVM world, for example, a maven-shading-plugin. #561 will break PySpark bindings totally just because at the moment bindings are relying on the @jasinliu May you explain a bit more the motivation behind it? What kind of problem or complications are you facing with Spark for local FS? |
I want to implement the cli tool based on spark dependencies. However, by default, constructing a sparksession requires starting a spark instance, which is slow. My current idea is to read info based on the local file system and use sparksession to get data. |
But |
I see. Then I will close this PR. I will implement a version based on the initialization of |
Since the |
Describe the enhancement requested
Curently, the scala library needs a spark process to use. But it will be a bit complicated if we only deal with local files.
I hope that we can support the local filesystem without sparksession. Now my idea is to allow the loader and some funcs in the scala library to allow the spark parameter to be None.
Component(s)
Spark
The text was updated successfully, but these errors were encountered: