[spark] Support report partitioning to eliminate shuffle exchange #3912

ulysses-you · 2024-08-06T07:11:17Z

Purpose

This pr makes PaimonScan implement SupportsReportPartitioning for Spark engine, so that we can eliminate shuffle exchange when doing join,aggregate,window,etc...

For example, the following query with join does not introduce shuffle exchange:

CREATE TABLE t1 (
    id BIGINT,
    c1 BIGINT,
    c2 STRING
) using paimon
TBLPROPERTIES (
    'primary-key' = 'id',
    'bucket' = '10'
) 

CREATE TABLE t2 (
    id BIGINT,
    c1 BIGINT,
    c2 STRING
) using paimon
TBLPROPERTIES (
    'primary-key' = 'id',
    'bucket' = '10'
) 

set spark.sql.autoBroadcastJoinThreshold=-1;
set spark.sql.sources.v2.bucketing.enabled=true;
SELECT * FROM t1 JOIN t2 ON t1.id = t2.id;

This feature depends on Spark storage partition join, in particular the interface KeyGroupedPartitioning which is from Spark3.3.

To achive the goal, this pr introduces BucketSpec for Paimon to hold the bucket related things:

public class BucketSpec {
    private BucketMode bucketMode;
    private List<String> bucketKeys;
    private int numBuckets;

and only if the bucket mode is HASH_FIXED we report the partitioning. It now supports primary table and bucket table using one column.

Also introduce FunctionCatalog for SparkBaseCatalog to resolve bucket tranform expression.

Tests

add new tests BucketedTableQueryTest

API and Format

no

Documentation

ulysses-you · 2024-08-06T08:05:57Z

cc @JingsongLi @YannByron do you have time to take a look ? thank you

JingsongLi · 2024-08-06T08:42:08Z

paimon-core/src/main/java/org/apache/paimon/schema/TableSchema.java

@@ -66,6 +67,10 @@ public class TableSchema implements Serializable {

    private final List<String> primaryKeys;

+    private List<String> bucketKeys;


this field can be final

JingsongLi · 2024-08-06T08:44:26Z

Wow, this is bucketed join, this optimization is on my list.
Thanks for the contribution! @ulysses-you

JingsongLi · 2024-08-06T08:52:19Z

paimon-spark/paimon-spark-3.5/src/main/scala/org/apache/paimon/spark/PaimonScan.scala

+          assert(bucketSpec.getBucketKeys.size() == 1)
+          Expressions.bucket(bucketSpec.getNumBucket, bucketSpec.getBucketKeys.get(0))
+          val key = Expressions.bucket(bucketSpec.getNumBucket, bucketSpec.getBucketKeys.get(0))
+          new KeyGroupedPartitioning(Array(key), lazyInputPartitions.size)


If it is an equivalent join of two keys, cannot it be supported?

Ignore me, I see Spark does not support bucket with several input attributes

yea, it seems an issue in Spark community. I did not find the strong reason why Spark forbid it..

JingsongLi · 2024-08-06T09:07:14Z

...on-spark/paimon-spark-3.5/src/main/java/org/apache/paimon/spark/catalog/PaimonFunctions.java

+     * bucket(10, col)` would fail since we do not implement {@link
+     * org.apache.spark.sql.connector.catalog.functions.ScalarFunction}
+     */
+    public static class BucketFunction implements UnboundFunction {


Because Paimon's bucket calculation method and Spark's bucket function are completely different implementations.

So this function is UnboundFunction? I'm not sure if my understanding is correct.

UnboundFunction is kind of a unresolved expression in Spark and finally will be resolved to BoundFunction, see UnboundFunction#bind method.

For paimon, I think it more like a placeholder as it is not used to do evaluation. It only used to compare if two partitioning are semantics equivalent.

An another thing maybe related. It seems Paimon did not follow Spark DSv2 write feature, e.g., RequiresDistributionAndOrdering, so the bucket function is only used to report partiioning.

For using RequiresDistributionAndOrdering:
The biggest problem before was: "Paimon's bucket calculation method and Spark's bucket function are completely different implementations".
Implementing org.apache.spark.sql.connector.catalog.functions.ScalarFunction looks like can solve this problem.

thank you @JingsongLi for the context, I will work on it in followups if find time.

YannByron · 2024-08-07T04:02:40Z

This feature depends on Spark storage partition join, in particular the interface KeyGroupedPartitioning which is from Spark3.3. For easy to review, this pr only supports it in Spark3.5.

Thanks for this pr @ulysses-you. Can we modify this pr to make it available for spark3.3+ at the first shot?

ulysses-you · 2024-08-07T05:53:24Z

@YannByron I made changes to support 3.3, 3.4 and 3.5, thank you

YannByron · 2024-08-07T06:20:30Z

@YannByron I made changes to support 3.3, 3.4 and 3.5, thank you

Hi, @ulysses-you paimon is different from iceberg in terms of multi-Spark version support. iceberg does this by copying the code, while paimon does this by extracting the common code and solving the compatibility. maybe @JingsongLi can explain more.

JingsongLi · 2024-08-07T08:44:09Z

@YannByron I made changes to support 3.3, 3.4 and 3.5, thank you

Hi, @ulysses-you paimon is different from iceberg in terms of multi-Spark version support. iceberg does this by copying the code, while paimon does this by extracting the common code and solving the compatibility. maybe @JingsongLi can explain more.

This is not easy to explain, I will try to clarify the trade-off here.

In the case of minimal changes, avoid copying a large amount of code, and even copy some classes to different versions to solve version incompatibility differences.

From my personal perspective, having a lot of code redundancy is not right, and it is not worth abstracting a lot for the sake of compatibility with a small part.

ulysses-you · 2024-08-07T10:18:34Z

thank you @JingsongLi , got it.

JingsongLi

Thanks @ulysses-you for update! Looks good to me!

And thanks @YannByron for review.

ulysses-you · 2024-08-15T02:35:29Z

Just to help user to trouble shotting, Spark community has a bug about bucketed scan if there are multi inner joins:

Cause: java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:264)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.createKeyGroupedShuffleSpec(EnsureRequirements.scala:642)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$checkKeyGroupCompatible$1(EnsureRequirements.scala:385)
at scala.collection.immutable.List.map(List.scala:247)
at scala.collection.immutable.List.map(List.scala:79)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:382)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.checkKeyGroupCompatible(EnsureRequirements.scala:364)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:166)
at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:714)
at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:689)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:528)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:528)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:497)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:689)
at org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:51)
at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$2(AdaptiveSparkPlanExec.scala:882)

See https://issues.apache.org/jira/browse/SPARK-49179. It has been fixed at 3.4.4/3.5.3/4.0.0.

YannByron · 2024-10-19T13:56:56Z

Link to #2404.

JingsongLi reviewed Aug 6, 2024

View reviewed changes

YannByron self-assigned this Aug 7, 2024

ulysses-you force-pushed the partitioning branch from 40e4d7f to 5cb8923 Compare August 7, 2024 05:51

ulysses-you force-pushed the partitioning branch from 5cb8923 to 9d20037 Compare August 7, 2024 07:07

Support report partitioning to eliminate shuffle exchange

c03b170

ulysses-you force-pushed the partitioning branch from 9d20037 to c03b170 Compare August 8, 2024 07:36

YannByron approved these changes Aug 8, 2024

View reviewed changes

JingsongLi approved these changes Aug 8, 2024

View reviewed changes

JingsongLi merged commit 3b9dd9b into apache:master Aug 8, 2024
10 checks passed

ulysses-you deleted the partitioning branch August 9, 2024 00:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Support report partitioning to eliminate shuffle exchange #3912

[spark] Support report partitioning to eliminate shuffle exchange #3912

ulysses-you commented Aug 6, 2024 •

edited

Loading

ulysses-you commented Aug 6, 2024

JingsongLi Aug 6, 2024

JingsongLi commented Aug 6, 2024

JingsongLi Aug 6, 2024

JingsongLi Aug 6, 2024

ulysses-you Aug 6, 2024

JingsongLi Aug 6, 2024

ulysses-you Aug 6, 2024

ulysses-you Aug 6, 2024

JingsongLi Aug 6, 2024

ulysses-you Aug 6, 2024

YannByron commented Aug 7, 2024

ulysses-you commented Aug 7, 2024

YannByron commented Aug 7, 2024

JingsongLi commented Aug 7, 2024

ulysses-you commented Aug 7, 2024

JingsongLi left a comment

ulysses-you commented Aug 15, 2024

YannByron commented Oct 19, 2024

		@@ -66,6 +67,10 @@ public class TableSchema implements Serializable {

		private final List<String> primaryKeys;

		private List<String> bucketKeys;

[spark] Support report partitioning to eliminate shuffle exchange #3912

[spark] Support report partitioning to eliminate shuffle exchange #3912

Conversation

ulysses-you commented Aug 6, 2024 • edited Loading

Purpose

Tests

API and Format

Documentation

ulysses-you commented Aug 6, 2024

Choose a reason for hiding this comment

JingsongLi commented Aug 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YannByron commented Aug 7, 2024

ulysses-you commented Aug 7, 2024

YannByron commented Aug 7, 2024

JingsongLi commented Aug 7, 2024

ulysses-you commented Aug 7, 2024

JingsongLi left a comment

Choose a reason for hiding this comment

ulysses-you commented Aug 15, 2024

YannByron commented Oct 19, 2024

ulysses-you commented Aug 6, 2024 •

edited

Loading