Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flink] Introduce Range Partition And Sort in Append Scalable Table Batch Writing for Flink #3384

Merged
merged 3 commits into from
May 31, 2024

Conversation

WencongLiu
Copy link
Contributor

@WencongLiu WencongLiu commented May 22, 2024

Purpose

Linked issue: close #3385
Introduce Range Partition And Sort in Append Scalable Table Batch Writing for Flink.

Tests

  1. Add TableSortInfoTest for the TableSortInfo.
  2. Add RangePartitionAndSortForUnawareBucketTableITCase.

API and Format

no

Documentation

Yes. The doc of FlinkConnectorOptions has been updated.

@WencongLiu WencongLiu force-pushed the dev_0522_1048 branch 2 times, most recently from a1d3c50 to 642b444 Compare May 23, 2024 02:35
Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @WencongLiu , left comments.

int sinkParallelism,
int localSampleSize,
int globalSampleSize) {
checkArgument(!sortColumns.isEmpty(), "Sort columns cannot be empty");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to check here, this is a private constructor.
You can check in the Builder.build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, placing all the check statements in the build is already sufficient. The changes have been made as requested.

public Builder setSinkParallelism(int sinkParallelism) {
checkArgument(
sinkParallelism > 0,
"The sink parallelism must be specified when sorting the table data. Please set it using the key: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use checkArgument(boolean expression, String errorMessageTemplate, @CheckForNull Object... errorMessageArgs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

public Builder setRangeNumber(int rangeNumber) {
checkArgument(rangeNumber > 0, "Range number must be positive");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move all check to build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// 4. Remove the range index or both range index and key. (shuffle according range
// partition)
if (outputWithKey) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to change here.
You can just output with key, and introduce a RemoveKeyOperator.

It should have no impact to performance if it is chaining to upstream operator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outputWithKey parameter is indeed unnecessary, and the changes have been made as requested.

if (boundedInput == null) {
boundedInput = !FlinkSink.isStreaming(input);
}
checkState(boundedInput, "The clustering should be executed under batch mode.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should discuss semantics, maybe we can just not sort in streaming mode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. +1 for not failing the job. We can enable the feature only when all the conditions are matched. Otherwise, if the by-columns are configured, we can print a warning log about it.

  2. We probably should also mention the limitations (batch mode, table type, etc.) in the configuration description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JingsongLi
Good point, ignoring clustering in stream mode is a sensible design. This avoids the need for users to manually adjust the table configuration under streaming mode.

@xintongsong

  1. I've removed the check and added releated warning log.
  2. I've added the limitations of table type and batch mode in configuration description. I've also added a commit to introduce the clustering feature in paimon docs.

@@ -119,8 +133,86 @@ public FlinkSinkBuilder inputBounded(boolean bounded) {
return this;
}

/** Set the table sort info. */
public FlinkSinkBuilder setTableSortInfo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusteringIfPossible?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -119,8 +133,86 @@ public FlinkSinkBuilder inputBounded(boolean bounded) {
return this;
}

/** Set the table sort info. */
public FlinkSinkBuilder setTableSortInfo(
String sortColumnsString,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

respect option keys, sortColumnsString => clusterColumns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


private final List<String> sortColumns;

private final String sortStrategy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to use OrderType for this field. This would allow us to check the field early, in Builder#setSortStrategy.

if (boundedInput == null) {
boundedInput = !FlinkSink.isStreaming(input);
}
checkState(boundedInput, "The clustering should be executed under batch mode.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. +1 for not failing the job. We can enable the feature only when all the conditions are matched. Otherwise, if the by-columns are configured, we can print a warning log about it.

  2. We probably should also mention the limitations (batch mode, table type, etc.) in the configuration description.

@@ -119,8 +133,86 @@ public FlinkSinkBuilder inputBounded(boolean bounded) {
return this;
}

/** Set the table sort info. */
public FlinkSinkBuilder setTableSortInfo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@WencongLiu WencongLiu force-pushed the dev_0522_1048 branch 3 times, most recently from 582e2ab to 860bc0f Compare May 28, 2024 06:40
Copy link
Contributor

@xintongsong xintongsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WencongLiu
Thanks for addressing my comments. LGTM.

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi
Copy link
Contributor

CC @leaves12138 to final check.

@JingsongLi JingsongLi requested a review from leaves12138 May 30, 2024 15:13
Copy link
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@leaves12138 leaves12138 merged commit 885c2e0 into apache:master May 31, 2024
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]Introduce Range Partition And Sort in Append Scalable Table Batch Writing for Flink
4 participants