Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(spark): Support local filesystem without sparksession #554

Closed
jasinliu opened this issue Jul 27, 2024 · 9 comments
Closed

feat(spark): Support local filesystem without sparksession #554

jasinliu opened this issue Jul 27, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@jasinliu
Copy link
Contributor

jasinliu commented Jul 27, 2024

Describe the enhancement requested

Curently, the scala library needs a spark process to use. But it will be a bit complicated if we only deal with local files.

I hope that we can support the local filesystem without sparksession. Now my idea is to allow the loader and some funcs in the scala library to allow the spark parameter to be None.

def loadGraphInfo(
      graphInfoPath: String,
      spark: Option[SparkSession] = None
  )

Component(s)

Spark

@jasinliu jasinliu added the enhancement New feature or request label Jul 27, 2024
@jasinliu
Copy link
Contributor Author

There is a serious problem here that is difficult to avoid.

GraphAr-commons relies on Spark's Dataframe, and the implementation of Dataframe requires SparkSession, so we cannot construct Dataframe directly without SparkSession.

@Thespica
Copy link
Contributor

Hi @jasinliu

We are about to let spark and java library to rely on java-info module(rather than the existing ffi-info or scala-info), which do not need spark session:

public static GraphInfo load(String graphPath) throws IOException {
return load(graphPath, new Configuration());
}
public static GraphInfo load(String graphPath, FileSystem fileSystem) throws IOException {
if (fileSystem == null) {
throw new IllegalArgumentException("FileSystem is null");
}
return load(graphPath, fileSystem.getConf());
}
public static GraphInfo load(String graphPath, Configuration conf) throws IOException {
if (conf == null) {
throw new IllegalArgumentException("Configuration is null");
}
Path path = new Path(graphPath);
FileSystem fileSystem = path.getFileSystem(conf);
FSDataInputStream inputStream = fileSystem.open(path);
Yaml graphYamlLoader =
new Yaml(new Constructor(GraphYamlParser.class, new LoaderOptions()));
GraphYamlParser graphYaml = graphYamlLoader.load(inputStream);
return new GraphInfo(graphYaml, conf);
}

@jasinliu
Copy link
Contributor Author

Hi @jasinliu

We are about to let spark and java library to rely on java-info module(rather than the existing ffi-info or scala-info), which do not need spark session:

public static GraphInfo load(String graphPath) throws IOException {
return load(graphPath, new Configuration());
}
public static GraphInfo load(String graphPath, FileSystem fileSystem) throws IOException {
if (fileSystem == null) {
throw new IllegalArgumentException("FileSystem is null");
}
return load(graphPath, fileSystem.getConf());
}
public static GraphInfo load(String graphPath, Configuration conf) throws IOException {
if (conf == null) {
throw new IllegalArgumentException("Configuration is null");
}
Path path = new Path(graphPath);
FileSystem fileSystem = path.getFileSystem(conf);
FSDataInputStream inputStream = fileSystem.open(path);
Yaml graphYamlLoader =
new Yaml(new Constructor(GraphYamlParser.class, new LoaderOptions()));
GraphYamlParser graphYaml = graphYamlLoader.load(inputStream);
return new GraphInfo(graphYaml, conf);
}

I see. Thanks for the reminder. Now I need this info library. Let me see if there's anything I can do to help.

@Thespica
Copy link
Contributor

@jasinliu

Actually, the library is here, but this library is under development(basic classes were done, and need unit test and refactor by protobuf). I am working on refactor by protobuf as issue #539

Since we had decided to unify the info for all library by protobuf. For java/scala, the ffi-info and sclala-info well be deprecated. So I suggest you to rely on the API of java-info module.

Besides, maybe we need suggestion form @acezen

@SemyonSinchenko
Copy link
Member

What is the idea behind pushing #561 to main? If for some reason we need a version of scala library without spark, it can be done in a separate persistent branch. Also, I do not fully understand the problem. Under the hood Spark is relying on the org.apache.hadoop.fs.FileSystem that has also an implementation for local file system.

If the problem is in dependency hell, why not to solve it? There are a lot of tools for that in the JVM world, for example, a maven-shading-plugin.

#561 will break PySpark bindings totally just because at the moment bindings are relying on the pyspark.sql.SparkSession._jvm. If we merge it, we will have a constantly failing CI pipelines.

@jasinliu May you explain a bit more the motivation behind it? What kind of problem or complications are you facing with Spark for local FS?

@jasinliu
Copy link
Contributor Author

What is the idea behind pushing #561 to main? If for some reason we need a version of scala library without spark, it can be done in a separate persistent branch. Also, I do not fully understand the problem. Under the hood Spark is relying on the org.apache.hadoop.fs.FileSystem that has also an implementation for local file system.

If the problem is in dependency hell, why not to solve it? There are a lot of tools for that in the JVM world, for example, a maven-shading-plugin.

#561 will break PySpark bindings totally just because at the moment bindings are relying on the pyspark.sql.SparkSession._jvm. If we merge it, we will have a constantly failing CI pipelines.

@jasinliu May you explain a bit more the motivation behind it? What kind of problem or complications are you facing with Spark for local FS?

I want to implement the cli tool based on spark dependencies. However, by default, constructing a sparksession requires starting a spark instance, which is slow. My current idea is to read info based on the local file system and use sparksession to get data.

@SemyonSinchenko
Copy link
Member

But SparkSession is a singletone by design, so if you need to use it sooner or later in the code, you should init it. And you should init it only once.

@jasinliu
Copy link
Contributor Author

But SparkSession is a singletone by design, so if you need to use it sooner or later in the code, you should init it. And you should init it only once.

I see. Then I will close this PR. I will implement a version based on the initialization of SparkSession first, and then improve it after java-info module is independent. Thank you very much~

@acezen
Copy link
Contributor

acezen commented Jul 30, 2024

What is the idea behind pushing #561 to main? If for some reason we need a version of scala library without spark, it can be done in a separate persistent branch. Also, I do not fully understand the problem. Under the hood Spark is relying on the org.apache.hadoop.fs.FileSystem that has also an implementation for local file system.
If the problem is in dependency hell, why not to solve it? There are a lot of tools for that in the JVM world, for example, a maven-shading-plugin.
#561 will break PySpark bindings totally just because at the moment bindings are relying on the pyspark.sql.SparkSession._jvm. If we merge it, we will have a constantly failing CI pipelines.
@jasinliu May you explain a bit more the motivation behind it? What kind of problem or complications are you facing with Spark for local FS?

I want to implement the cli tool based on spark dependencies. However, by default, constructing a sparksession requires starting a spark instance, which is slow. My current idea is to read info based on the local file system and use sparksession to get data.

Since the info module will be unify in the future and without binding to spark, I think it's not a concern for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants