feat(spark): Support local filesystem without sparksession #554

jasinliu · 2024-07-27T12:20:57Z

Describe the enhancement requested

Curently, the scala library needs a spark process to use. But it will be a bit complicated if we only deal with local files.

I hope that we can support the local filesystem without sparksession. Now my idea is to allow the loader and some funcs in the scala library to allow the spark parameter to be None.

def loadGraphInfo(
      graphInfoPath: String,
      spark: Option[SparkSession] = None
  )

Component(s)

Spark

jasinliu · 2024-07-27T13:58:46Z

There is a serious problem here that is difficult to avoid.

GraphAr-commons relies on Spark's Dataframe, and the implementation of Dataframe requires SparkSession, so we cannot construct Dataframe directly without SparkSession.

Thespica · 2024-07-27T14:19:05Z

Hi @jasinliu

We are about to let spark and java library to rely on java-info module(rather than the existing ffi-info or scala-info), which do not need spark session:

incubator-graphar/maven-projects/info/src/main/java/org/apache/graphar/info/GraphInfo.java

Lines 99 to 121 in 270f333

    
           public static GraphInfo load(String graphPath) throws IOException { 
        
               return load(graphPath, new Configuration()); 
        
           } 
        
           public static GraphInfo load(String graphPath, FileSystem fileSystem) throws IOException { 
        
               if (fileSystem == null) { 
        
                   throw new IllegalArgumentException("FileSystem is null"); 
        
               } 
        
               return load(graphPath, fileSystem.getConf()); 
        
           } 
        
           public static GraphInfo load(String graphPath, Configuration conf) throws IOException { 
        
               if (conf == null) { 
        
                   throw new IllegalArgumentException("Configuration is null"); 
        
               } 
        
               Path path = new Path(graphPath); 
        
               FileSystem fileSystem = path.getFileSystem(conf); 
        
               FSDataInputStream inputStream = fileSystem.open(path); 
        
               Yaml graphYamlLoader = 
        
                       new Yaml(new Constructor(GraphYamlParser.class, new LoaderOptions())); 
        
               GraphYamlParser graphYaml = graphYamlLoader.load(inputStream); 
        
               return new GraphInfo(graphYaml, conf); 
        
           }

jasinliu · 2024-07-27T14:24:56Z

Hi @jasinliu

We are about to let spark and java library to rely on java-info module(rather than the existing ffi-info or scala-info), which do not need spark session:

incubator-graphar/maven-projects/info/src/main/java/org/apache/graphar/info/GraphInfo.java

Lines 99 to 121 in 270f333

public static GraphInfo load(String graphPath) throws IOException {

return load(graphPath, new Configuration());

}

public static GraphInfo load(String graphPath, FileSystem fileSystem) throws IOException {

if (fileSystem == null) {

throw new IllegalArgumentException("FileSystem is null");

}

return load(graphPath, fileSystem.getConf());

}

public static GraphInfo load(String graphPath, Configuration conf) throws IOException {

if (conf == null) {

throw new IllegalArgumentException("Configuration is null");

}

Path path = new Path(graphPath);

FileSystem fileSystem = path.getFileSystem(conf);

FSDataInputStream inputStream = fileSystem.open(path);

Yaml graphYamlLoader =

new Yaml(new Constructor(GraphYamlParser.class, new LoaderOptions()));

GraphYamlParser graphYaml = graphYamlLoader.load(inputStream);

return new GraphInfo(graphYaml, conf);

}

I see. Thanks for the reminder. Now I need this info library. Let me see if there's anything I can do to help.

Thespica · 2024-07-28T04:55:28Z

@jasinliu

Actually, the library is here, but this library is under development(basic classes were done, and need unit test and refactor by protobuf). I am working on refactor by protobuf as issue #539

Since we had decided to unify the info for all library by protobuf. For java/scala, the ffi-info and sclala-info well be deprecated. So I suggest you to rely on the API of java-info module.

Besides, maybe we need suggestion form @acezen

SemyonSinchenko · 2024-07-30T06:50:06Z

What is the idea behind pushing #561 to main? If for some reason we need a version of scala library without spark, it can be done in a separate persistent branch. Also, I do not fully understand the problem. Under the hood Spark is relying on the org.apache.hadoop.fs.FileSystem that has also an implementation for local file system.

If the problem is in dependency hell, why not to solve it? There are a lot of tools for that in the JVM world, for example, a maven-shading-plugin.

#561 will break PySpark bindings totally just because at the moment bindings are relying on the pyspark.sql.SparkSession._jvm. If we merge it, we will have a constantly failing CI pipelines.

@jasinliu May you explain a bit more the motivation behind it? What kind of problem or complications are you facing with Spark for local FS?

jasinliu · 2024-07-30T07:04:29Z

What is the idea behind pushing #561 to main? If for some reason we need a version of scala library without spark, it can be done in a separate persistent branch. Also, I do not fully understand the problem. Under the hood Spark is relying on the org.apache.hadoop.fs.FileSystem that has also an implementation for local file system.

If the problem is in dependency hell, why not to solve it? There are a lot of tools for that in the JVM world, for example, a maven-shading-plugin.

#561 will break PySpark bindings totally just because at the moment bindings are relying on the pyspark.sql.SparkSession._jvm. If we merge it, we will have a constantly failing CI pipelines.

@jasinliu May you explain a bit more the motivation behind it? What kind of problem or complications are you facing with Spark for local FS?

I want to implement the cli tool based on spark dependencies. However, by default, constructing a sparksession requires starting a spark instance, which is slow. My current idea is to read info based on the local file system and use sparksession to get data.

SemyonSinchenko · 2024-07-30T07:07:15Z

But SparkSession is a singletone by design, so if you need to use it sooner or later in the code, you should init it. And you should init it only once.

jasinliu · 2024-07-30T07:14:36Z

But SparkSession is a singletone by design, so if you need to use it sooner or later in the code, you should init it. And you should init it only once.

I see. Then I will close this PR. I will implement a version based on the initialization of SparkSession first, and then improve it after java-info module is independent. Thank you very much~

acezen · 2024-07-30T07:32:53Z

What is the idea behind pushing #561 to main? If for some reason we need a version of scala library without spark, it can be done in a separate persistent branch. Also, I do not fully understand the problem. Under the hood Spark is relying on the org.apache.hadoop.fs.FileSystem that has also an implementation for local file system.
If the problem is in dependency hell, why not to solve it? There are a lot of tools for that in the JVM world, for example, a maven-shading-plugin.
#561 will break PySpark bindings totally just because at the moment bindings are relying on the pyspark.sql.SparkSession._jvm. If we merge it, we will have a constantly failing CI pipelines.
@jasinliu May you explain a bit more the motivation behind it? What kind of problem or complications are you facing with Spark for local FS?

I want to implement the cli tool based on spark dependencies. However, by default, constructing a sparksession requires starting a spark instance, which is slow. My current idea is to read info based on the local file system and use sparksession to get data.

Since the info module will be unify in the future and without binding to spark, I think it's not a concern for that.

jasinliu added the enhancement New feature or request label Jul 27, 2024

jasinliu mentioned this issue Jul 29, 2024

feat(spark): Support local filesystem in spark library for graphar-info temporarily #561

Closed

jasinliu closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spark): Support local filesystem without sparksession #554

feat(spark): Support local filesystem without sparksession #554

jasinliu commented Jul 27, 2024 •

edited

Loading

jasinliu commented Jul 27, 2024

Thespica commented Jul 27, 2024

jasinliu commented Jul 27, 2024

Thespica commented Jul 28, 2024

SemyonSinchenko commented Jul 30, 2024

jasinliu commented Jul 30, 2024

SemyonSinchenko commented Jul 30, 2024

jasinliu commented Jul 30, 2024

acezen commented Jul 30, 2024 •

edited

Loading

feat(spark): Support local filesystem without sparksession #554

feat(spark): Support local filesystem without sparksession #554

Comments

jasinliu commented Jul 27, 2024 • edited Loading

Describe the enhancement requested

Component(s)

jasinliu commented Jul 27, 2024

Thespica commented Jul 27, 2024

jasinliu commented Jul 27, 2024

Thespica commented Jul 28, 2024

SemyonSinchenko commented Jul 30, 2024

jasinliu commented Jul 30, 2024

SemyonSinchenko commented Jul 30, 2024

jasinliu commented Jul 30, 2024

acezen commented Jul 30, 2024 • edited Loading

jasinliu commented Jul 27, 2024 •

edited

Loading

acezen commented Jul 30, 2024 •

edited

Loading