Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation #1178

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

grazy27
Copy link

@grazy27 grazy27 commented Jul 8, 2024

Changes:

  • Implemented compatibility with Spark 3.5.3(Fixes in the separate commit),
  • Updated project dependencies
  • Updated .net6 -> .net8, .net461 -> .net48.
  • Extracted binary formatter in a separate class + tests
  • Fixed a number of small bugs, such as nullrefs, windows path with whitespaces, running locally and on Data Bricks.
  • Added a documentation page, that contains component and sequence diagrams for Dotnet Spark. (Such diagrams would help me significantly, it's worth adding)
  • Implemented a CoGrouped vector UDFs, that allow passing 2 DFs.

Tested with:

Spark

Each time on stop there's an exception, that doesn't affect execution
ERROR DotnetBackendHandler: Exception caught: java.net.SocketException: Connection reset at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)

  • Works with local 3.5.1
  • Works with local 3.5.2 (If setting 'IgnorePathVersion...' is enabled in scala)

Databricks:

  • Fails on 15.4:
[Error] [JvmBridge] JVM method execution failed: Static method 'createPythonFunction' failed for class 'org.apache.spark.sql.api.dotnet.SQLUtils' when called with 7 arguments ([Index=1, Type=Byte[], Value=System.Byte[]], [Index=2, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], [Index=3, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=4, Type=String, Value=Microsoft.Spark.Worker], [Index=5, Type=String, Value=2.1.1.0], [Index=6, Type=ArrayList, Value=Microsoft.Spark.Interop.Internal.Java.Util.ArrayList], [Index=7, Type=null, Value=null], )
[2024-09-13T10:47:53.1569404Z] [machine] [Error] [JvmBridge] java.lang.NoSuchMethodError: org.apache.spark.api.python.SimplePythonFunction.<init>(Lscala/collection/Seq;Ljava/util/Map;Ljava/util/List;Ljava/lang/String;Ljava/lang/String;Ljava/util/List;Lorg/apache/spark/api/python/PythonAccumulatorV2;)V
	at org.apache.spark.sql.api.dotnet.SQLUtils$.createPythonFunction(SQLUtils.scala:35)
	at org.apache.spark.sql.api.dotnet.SQLUtils.createPythonFunction(SQLUtils.scala)
  • Works on 14.3:
    On Databricks, UseArrow is always true, and Vector UDFs don't work because spark divides recordbatch to a collection of batches, and code expects a single batch.

Affected tickets:

@grazy27 grazy27 changed the title Spark 3.5.1, .NET 8, Dependencies and documentation Spark 3.5.1, .NET 8, Dependencies and Documentation Jul 8, 2024
@grazy27
Copy link
Author

grazy27 commented Jul 8, 2024

@dotnet-policy-service agree

@GeorgeS2019
Copy link

GeorgeS2019 commented Jul 22, 2024

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated.
Are you able to get all of them to pass?

image

@grazy27
Copy link
Author

grazy27 commented Jul 22, 2024

@grazy27

Can you share how many of the unit tests pass

The UDF unit tests have not been updated. Are you able to get all of them to pass?

image

Hello @GeorgeS2019 , they do.
image

Saw your issue, probably my env uses UTF8 by default.
Several tests fail from time to time with executor driver): java.nio.file.NoSuchFileException: C:\Users\grazy27\AppData\Local\Temp\spark-cc2cf7bc-3c8c-4fdf-a496-266424de943d\userFiles-92d122bb-af9a-40ea-a430-131454afc705\archive.zip
But they pass if run second time, so I didn't dive deeper

@travis-leith
Copy link

What is the status of this PR?

@grazy27
Copy link
Author

grazy27 commented Aug 26, 2024

What is the status of this PR?

It works, the tests pass, and performance-wise, it's the best solution I've found for integrating .NET with Spark. The next steps are on Microsoft's side.

I'm also working on implementing CoGrouped UDFs, and I plan to push those updates here as well

@GeorgeS2019
Copy link

GeorgeS2019 commented Aug 26, 2024

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF after making adjustment to migrate to .net6.

#796

image
https://github.com/Apress/introducing-.net-for-apache-spark/tree/main/ch04/Chapter4

@travis-leith
Copy link

The next steps are on Microsoft's side.

Any idea who is "in charge" of this repo?

@grazy27
Copy link
Author

grazy27 commented Aug 26, 2024

@grazy27

Can you investigate if your solution work in polyglot .NET notebook interactive?

Previously we all had problem with the UDF.

#796

I can take a look, but only if a lonely evening with bad weather rolls around :) No promises, as this isn’t my primary focus.

There are two suggestions from developers that might help. The first is for a separate code cell, and the second is for a separate environment variable. Have you tried both approaches, and does the issue still persist?

@grazy27 grazy27 changed the title Spark 3.5.1, .NET 8, Dependencies and Documentation Spark 3.5.3, .NET 8, CoGrouped UDFs, Fixes, Dependencies and Documentation Nov 23, 2024
@grazy27 grazy27 mentioned this pull request Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants