Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spark] Support spark 4.0 #4325

Merged
merged 9 commits into from
Nov 11, 2024
Merged

[spark] Support spark 4.0 #4325

merged 9 commits into from
Nov 11, 2024

Conversation

Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Oct 15, 2024

Purpose

to #3940, support Spark 4.0, the following are the module relationships

image

Tests

API and Format

Documentation

@Zouxxyy Zouxxyy force-pushed the dev/spark4.0-1 branch 4 times, most recently from 5f291ae to a122bfd Compare October 17, 2024 01:56
@Zouxxyy Zouxxyy changed the title [WIP][spark] Support spark 4.0 [spark] Support spark 4.0 Oct 17, 2024
pom.xml Outdated Show resolved Hide resolved
<scala.version>${scala212.version}</scala.version>
<paimon-spark-common.spark.version>3.5.3</paimon-spark-common.spark.version>
<paimon-spark-x.x.common>paimon-spark-3.x-common</paimon-spark-x.x.common>
<test.spark.main.version>3.3</test.spark.main.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test.spark.version ?
and why just test spark 3.3 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because paimon use https://github.com/big-data-europe/docker-spark to run e2e test now, and it only supports up to 3.3.0 now, add a todo here

@@ -122,7 +122,11 @@ class BucketedTableQueryTest extends PaimonSparkTestBase with AdaptiveSparkPlanH
spark.sql(
"CREATE TABLE t5 (id INT, c STRING) TBLPROPERTIES ('primary-key' = 'id', 'bucket'='10')")
spark.sql("INSERT INTO t5 VALUES (1, 'x1')")
checkAnswerAndShuffleSorts("SELECT * FROM t1 JOIN t5 on t1.id = t5.id", 2, 2)
if (gteqSpark4_0) {
checkAnswerAndShuffleSorts("SELECT * FROM t1 JOIN t5 on t1.id = t5.id", 0, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the JIRA here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an optimization made in spark4.0. I looked for it, but couldn't find the specific PR

@YannByron
Copy link
Contributor

please address some comments.

<scala.binary.version>2.12</scala.binary.version>
<scala.version>${scala212.version}</scala.version>
<paimon-spark-common.spark.version>3.5.3</paimon-spark-common.spark.version>
<paimon-sparkx-common>paimon-spark3-common</paimon-sparkx-common>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to paimon-spark-common . cc @JingsongLi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paimon-spark-common may cause ambiguity with the existing paimon-spark-common module, I prefer to distinguish them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with paimon-sparkx-common.

@@ -152,10 +156,10 @@ class BucketedTableQueryTest extends PaimonSparkTestBase with AdaptiveSparkPlanH
checkAnswerAndShuffleSorts("SELECT id, max(c) FROM t1 GROUP BY id", 0, 0)
checkAnswerAndShuffleSorts("SELECT c, count(*) FROM t1 GROUP BY c", 1, 0)
checkAnswerAndShuffleSorts("SELECT c, max(c) FROM t1 GROUP BY c", 1, 2)
checkAnswerAndShuffleSorts("select sum(c) OVER (PARTITION BY id ORDER BY c) from t1", 0, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Spark 4.0 no longer supports directly calculating the sum of strings, this should be a typo, change it to max like the case above

@YannByron
Copy link
Contributor

LGTM. left two comments. cc @JingsongLi

@@ -34,10 +34,22 @@ under the License.
<name>Paimon : Spark : Common</name>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add scala version here, because this paimon-spark-common will produce two versions. So we should distinguish them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@YannByron YannByron merged commit a389af6 into apache:master Nov 11, 2024
12 checks passed
@Zouxxyy Zouxxyy deleted the dev/spark4.0-1 branch November 13, 2024 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants