[spark] Support spark 4.0 #4325

Zouxxyy · 2024-10-15T02:41:53Z

Purpose

to #3940, support Spark 4.0, the following are the module relationships

Tests

API and Format

Documentation

pom.xml

YannByron · 2024-10-28T02:19:13Z

pom.xml

+                <scala.version>${scala212.version}</scala.version>
+                <paimon-spark-common.spark.version>3.5.3</paimon-spark-common.spark.version>
+                <paimon-spark-x.x.common>paimon-spark-3.x-common</paimon-spark-x.x.common>
+                <test.spark.main.version>3.3</test.spark.main.version>


test.spark.version ?
and why just test spark 3.3 ?

because paimon use https://github.com/big-data-europe/docker-spark to run e2e test now, and it only supports up to 3.3.0 now, add a todo here

paimon-spark/paimon-spark-3.x-common/pom.xml

paimon-spark/paimon-spark-4.x-common/pom.xml

paimon-spark/paimon-spark-4.x-common/src/main/scala/org/apache/spark/sql/shims.scala

paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/SparkGenericCatalog.java

...imon-spark-common/src/main/scala/org/apache/paimon/spark/commands/MergeIntoPaimonTable.scala

YannByron · 2024-10-28T02:42:16Z

.../paimon-spark-common/src/test/scala/org/apache/paimon/spark/sql/BucketedTableQueryTest.scala

@@ -122,7 +122,11 @@ class BucketedTableQueryTest extends PaimonSparkTestBase with AdaptiveSparkPlanH
      spark.sql(
        "CREATE TABLE t5 (id INT, c STRING) TBLPROPERTIES ('primary-key' = 'id', 'bucket'='10')")
      spark.sql("INSERT INTO t5 VALUES (1, 'x1')")
-      checkAnswerAndShuffleSorts("SELECT * FROM t1 JOIN t5 on t1.id = t5.id", 2, 2)
+      if (gteqSpark4_0) {
+        checkAnswerAndShuffleSorts("SELECT * FROM t1 JOIN t5 on t1.id = t5.id", 0, 0)


put the JIRA here

This should be an optimization made in spark4.0. I looked for it, but couldn't find the specific PR

.../paimon-spark-common/src/test/scala/org/apache/paimon/spark/sql/BucketedTableQueryTest.scala

YannByron · 2024-10-28T02:48:24Z

please address some comments.

YannByron · 2024-11-08T02:02:47Z

pom.xml

+                <scala.binary.version>2.12</scala.binary.version>
+                <scala.version>${scala212.version}</scala.version>
+                <paimon-spark-common.spark.version>3.5.3</paimon-spark-common.spark.version>
+                <paimon-sparkx-common>paimon-spark3-common</paimon-sparkx-common>


I prefer to paimon-spark-common . cc @JingsongLi

paimon-spark-common may cause ambiguity with the existing paimon-spark-common module, I prefer to distinguish them

I am OK with paimon-sparkx-common.

YannByron · 2024-11-08T02:07:13Z

.../paimon-spark-common/src/test/scala/org/apache/paimon/spark/sql/BucketedTableQueryTest.scala

@@ -152,10 +156,10 @@ class BucketedTableQueryTest extends PaimonSparkTestBase with AdaptiveSparkPlanH
      checkAnswerAndShuffleSorts("SELECT id, max(c) FROM t1 GROUP BY id", 0, 0)
      checkAnswerAndShuffleSorts("SELECT c, count(*) FROM t1 GROUP BY c", 1, 0)
      checkAnswerAndShuffleSorts("SELECT c, max(c) FROM t1 GROUP BY c", 1, 2)
-      checkAnswerAndShuffleSorts("select sum(c) OVER (PARTITION BY id ORDER BY c) from t1", 0, 1)


why change this test?

Because Spark 4.0 no longer supports directly calculating the sum of strings, this should be a typo, change it to max like the case above

YannByron · 2024-11-08T02:08:14Z

LGTM. left two comments. cc @JingsongLi

JingsongLi · 2024-11-11T02:56:22Z

paimon-spark/paimon-spark-common/pom.xml

@@ -34,10 +34,22 @@ under the License.
    <name>Paimon : Spark : Common</name>


We should add scala version here, because this paimon-spark-common will produce two versions. So we should distinguish them.

JingsongLi

+1

JingsongLi mentioned this pull request Oct 15, 2024

Compatible with Spark4 （upgrade antlr4 to version 4.13.1 Compatible with jdk17 ） #4222

Closed

Zouxxyy force-pushed the dev/spark4.0-1 branch 4 times, most recently from 5f291ae to a122bfd Compare October 17, 2024 01:56

Zouxxyy changed the title ~~[WIP][spark] Support spark 4.0~~ [spark] Support spark 4.0 Oct 17, 2024

Zouxxyy force-pushed the dev/spark4.0-1 branch from a122bfd to d19542c Compare October 25, 2024 10:15