[spark] Support auto disable bucketed scan #3928

ulysses-you · 2024-08-09T08:18:24Z

Purpose

This pr adds a new rule DisableUnnecessaryPaimonBucketedScan to support auto disable bucketed scan if the bucket scan is not actually effective i.e., there is no shuffle exchange been removed. This change is to avoid performance regression since the bucketed scan may have smaller parallelism than normal scan.

For example: a table with bucket key x but user join/group-by/partition-by on column y.

Note, this rule is inspired from Spark DisableUnnecessaryBucketedScan but work for v2 scan.

Tests

Add test.

API and Format

no

Documentation

JingsongLi · 2024-08-11T11:42:59Z

It seems spark test failed.

ulysses-you · 2024-08-12T01:35:54Z

@JingsongLi thank you for the reminder, it took me a while to find the root cause...

ulysses-you · 2024-08-12T10:13:45Z

@JingsongLi @YannByron do you have to take a look ? thank you

JingsongLi

Looks good to me! @ulysses-you

Just left one minor comment.

JingsongLi · 2024-08-16T04:12:44Z

paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/PaimonScan.scala

  extends PaimonBaseScan(table, requiredSchema, filters, reservedFilters, pushDownLimit)
  with SupportsRuntimeFiltering
  with SupportsReportPartitioning {

-  override def outputPartitioning(): Partitioning = {
+  def withDisabledBucketedScan(): PaimonScan = {


disableBucketedScan

good naming! but it seems conflict with line40..

bucketedScanDisabled and disableBucketedScan?

+1, addressed

JingsongLi · 2024-08-16T06:25:06Z

+1 Thanks @ulysses-you for the contribution. Merging...

ulysses-you force-pushed the bucketed branch from b3d2521 to 7ec68cf Compare August 9, 2024 09:48

YannByron self-requested a review August 13, 2024 01:57

JingsongLi reviewed Aug 16, 2024

View reviewed changes

ulysses-you added 4 commits August 16, 2024 13:11

Support auto disable bucketed scan

2dac53d

fix

503ae27

Fix

d833dd5

address comment

5c0db15

ulysses-you force-pushed the bucketed branch from e85583e to 5c0db15 Compare August 16, 2024 05:11

JingsongLi merged commit a7e7bf6 into apache:master Aug 16, 2024
10 checks passed

ulysses-you deleted the bucketed branch August 16, 2024 06:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] Support auto disable bucketed scan #3928

[spark] Support auto disable bucketed scan #3928

ulysses-you commented Aug 9, 2024

JingsongLi commented Aug 11, 2024

ulysses-you commented Aug 12, 2024

ulysses-you commented Aug 12, 2024

JingsongLi left a comment

JingsongLi Aug 16, 2024

ulysses-you Aug 16, 2024

JingsongLi Aug 16, 2024

ulysses-you Aug 16, 2024

JingsongLi commented Aug 16, 2024

[spark] Support auto disable bucketed scan #3928

[spark] Support auto disable bucketed scan #3928

Conversation

ulysses-you commented Aug 9, 2024

Purpose

Tests

API and Format

Documentation

JingsongLi commented Aug 11, 2024

ulysses-you commented Aug 12, 2024

ulysses-you commented Aug 12, 2024

JingsongLi left a comment

Choose a reason for hiding this comment

JingsongLi Aug 16, 2024

Choose a reason for hiding this comment

ulysses-you Aug 16, 2024

Choose a reason for hiding this comment

JingsongLi Aug 16, 2024

Choose a reason for hiding this comment

ulysses-you Aug 16, 2024

Choose a reason for hiding this comment

JingsongLi commented Aug 16, 2024