-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[spark] Support auto disable bucketed scan #3928
Conversation
It seems spark test failed. |
@JingsongLi thank you for the reminder, it took me a while to find the root cause... |
@JingsongLi @YannByron do you have to take a look ? thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! @ulysses-you
Just left one minor comment.
extends PaimonBaseScan(table, requiredSchema, filters, reservedFilters, pushDownLimit) | ||
with SupportsRuntimeFiltering | ||
with SupportsReportPartitioning { | ||
|
||
override def outputPartitioning(): Partitioning = { | ||
def withDisabledBucketedScan(): PaimonScan = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disableBucketedScan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good naming! but it seems conflict with line40..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bucketedScanDisabled and disableBucketedScan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, addressed
+1 Thanks @ulysses-you for the contribution. Merging... |
Purpose
This pr adds a new rule
DisableUnnecessaryPaimonBucketedScan
to support auto disable bucketed scan if the bucket scan is not actually effective i.e., there is no shuffle exchange been removed. This change is to avoid performance regression since the bucketed scan may have smaller parallelism than normal scan.For example: a table with bucket key x but user join/group-by/partition-by on column y.
Note, this rule is inspired from Spark
DisableUnnecessaryBucketedScan
but work for v2 scan.Tests
Add test.
API and Format
no
Documentation