Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation #1178

grazy27 · 2024-07-08T20:50:28Z

Changes:

Implemented compatibility with Spark 3.5.3(Fixes in the separate commit),
Updated project dependencies
Updated .net6 -> .net8, .net461 -> .net48.
Extracted binary formatter in a separate class + tests
Fixed a number of small bugs, such as nullrefs, windows path with whitespaces, running locally and on Data Bricks.
Added a documentation page, that contains component and sequence diagrams for Dotnet Spark. (Such diagrams would help me significantly, it's worth adding)
Implemented a CoGrouped vector UDFs, that allow passing 2 DFs.

Tested with:

Spark

Each time on stop there's an exception, that doesn't affect execution
ERROR DotnetBackendHandler: Exception caught: java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)

Works with local 3.5.1
Works with local 3.5.2 (If setting 'IgnorePathVersion...' is enabled in scala)

Databricks:

Fails on 15.4:

[Error] [JvmBridge] JVM method execution failed: Static method 'createPythonFunction' failed for class 'org.apache.spark.sql.api.dotnet.SQLUtils' when called with 7 arguments ([Index=1, Type=Byte[], Value=System.Byte[]], [Index=2, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], [Index=3, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=4, Type=String, Value=Microsoft.Spark.Worker], [Index=5, Type=String, Value=2.1.1.0], [Index=6, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=7, Type=null, Value=null], )
[2024-09-13T10:47:53.1569404Z] [machine] [Error] [JvmBridge] java.lang.NoSuchMethodError: org.apache.spark.api.python.SimplePythonFunction.<init>(Lscala/collection/Seq;Ljava/util/Map;Ljava/util/List;Ljava/lang/String;Ljava/lang/String;Ljava/util/List;Lorg/apache/spark/api/python/PythonAccumulatorV2;)V
	at org.apache.spark.sql.api.dotnet.SQLUtils$.createPythonFunction(SQLUtils.scala:35)
	at org.apache.spark.sql.api.dotnet.SQLUtils.createPythonFunction(SQLUtils.scala)

Works on 14.3:
On Databricks, UseArrow is always true, and Vector UDFs don't work because spark divides recordbatch to a collection of batches, and code expects a single batch.

Affected tickets:

[FEATURE REQUEST]: .Net 8.0 support #1170

…4.x, 3.3.4+ as well.

…h Apache Spark.

grazy27 · 2024-07-08T20:55:41Z

@dotnet-policy-service agree

…alization in Worker.

GeorgeS2019 · 2024-07-22T04:49:47Z

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated.
Are you able to get all of them to pass?

grazy27 · 2024-07-22T11:56:08Z

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated. Are you able to get all of them to pass?

Hello @GeorgeS2019 , they do.

Saw your issue, probably my env uses UTF8 by default.
Several tests fail from time to time with executor driver): java.nio.file.NoSuchFileException: C:\Users\grazy27\AppData\Local\Temp\spark-cc2cf7bc-3c8c-4fdf-a496-266424de943d\userFiles-92d122bb-af9a-40ea-a430-131454afc705\archive.zip
But they pass if run second time, so I didn't dive deeper

travis-leith · 2024-08-26T07:02:39Z

What is the status of this PR?

grazy27 · 2024-08-26T07:14:50Z

What is the status of this PR?

It works, the tests pass, and performance-wise, it's the best solution I've found for integrating .NET with Spark. The next steps are on Microsoft's side.

I'm also working on implementing CoGrouped UDFs, and I plan to push those updates here as well

GeorgeS2019 · 2024-08-26T07:17:02Z

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF after making adjustment to migrate to .net6.

#796

https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4

travis-leith · 2024-08-26T07:18:34Z

The next steps are on Microsoft's side.

Any idea who is "in charge" of this repo?

grazy27 · 2024-08-26T07:30:27Z

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF.

#796

I can take a look, but only if a lonely evening with bad weather rolls around :) No promises, as this isn’t my primary focus.

There are two suggestions from developers that might help. The first is for a separate code cell, and the second is for a separate environment variable. Have you tried both approaches, and does the issue still persist?

… users with configuration

…nvironments

grazy27 added 6 commits July 6, 2024 11:32

Copy-pasted scala impl for 3.5 from 3.2

5b2aec6

Implemented support for Spark 3.5.1. Fixes are relevant for 3.5.0, 3.…

e7eccdf

…4.x, 3.3.4+ as well.

Update .NET 6 to .NET 8, .NET 461 -> net4.8, dependencies update

30c268b

Extracted BinaryFormatter into a separate class

0a823d3

Non-Breaking improvements: Bugfixes, code style, improvements

0744cf4

A new documentation page, that describes Dotnet Spark integration wit…

c53139b

…h Apache Spark.

grazy27 changed the title ~~Spark 3.5.1, .NET 8, Dependencies and documentation~~ Spark 3.5.1, .NET 8, Dependencies and Documentation Jul 8, 2024

Marked BinaryFormatter as a ThreadStatic to ensure thread safety seri…

7df86e7

…alization in Worker.

GeorgeS2019 mentioned this pull request Jul 22, 2024

[BUG]: [Spark.NET 3.5.1] Unable to get Charset 'cp65001' for property 'sun.stderr.encoding' #1180

Open

grazy27 added 5 commits November 23, 2024 13:50

Bumped up supported spark version to 3.5.3, added logging to help new…

95c6608

… users with configuration

Bugfix: Fixed exception on execution finish

c7f950a

Merge branch 'main' of https://github.com/grazy27/spark

a347a58

Bugfix: Added a backward-compatible FetchFile for use in Databricks e…

5f67d45

…nvironments

Added a CoGrouped vector UDF support

e42631e

grazy27 changed the title ~~Spark 3.5.1, .NET 8, Dependencies and Documentation~~ Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation Nov 23, 2024

grazy27 mentioned this pull request Nov 23, 2024

[FEATURE REQUEST]: #1184

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation #1178

Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation #1178

grazy27 commented Jul 8, 2024 •

edited

Loading

grazy27 commented Jul 8, 2024

GeorgeS2019 commented Jul 22, 2024 •

edited

Loading

grazy27 commented Jul 22, 2024

travis-leith commented Aug 26, 2024

grazy27 commented Aug 26, 2024

GeorgeS2019 commented Aug 26, 2024 •

edited

Loading

travis-leith commented Aug 26, 2024

grazy27 commented Aug 26, 2024

Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation #1178

Are you sure you want to change the base?

Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation #1178

Conversation

grazy27 commented Jul 8, 2024 • edited Loading

Changes:

Tested with:

Spark

Databricks:

Affected tickets:

grazy27 commented Jul 8, 2024

GeorgeS2019 commented Jul 22, 2024 • edited Loading

grazy27 commented Jul 22, 2024

travis-leith commented Aug 26, 2024

grazy27 commented Aug 26, 2024

GeorgeS2019 commented Aug 26, 2024 • edited Loading

travis-leith commented Aug 26, 2024

grazy27 commented Aug 26, 2024

grazy27 commented Jul 8, 2024 •

edited

Loading

GeorgeS2019 commented Jul 22, 2024 •

edited

Loading

GeorgeS2019 commented Aug 26, 2024 •

edited

Loading