Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flink] add coordinate and worker operator for small changelog files compaction #4380

Merged
merged 7 commits into from
Oct 31, 2024

Conversation

LsomeYeah
Copy link
Contributor

Purpose

Linked issue: close #xxx

Add a Coordinator node to small changelog files compaction pipeline to decide how to concatenate it into a target file size result file, which can be one or multiple files, and add a worker node to merge those small files.

Tests

API and Format

Documentation

tsreaper and others added 2 commits October 25, 2024 14:09
# Conflicts:
#	paimon-flink/paimon-flink-common/src/test/java/org/apache/paimon/flink/PrimaryKeyFileStoreTableITCase.java
}

private void emitPartitionChangelogCompactTask(BinaryRow partition) {
PartitionChangelog partitionChangelog = partitionChangelogs.get(partition);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionChangelog may be null or not?

private final Map<Integer, List<DataFileMeta>> newFileChangelogFiles;
private final Map<Integer, List<DataFileMeta>> compactChangelogFiles;

public long totalFileSize() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mothod has not be called, delete it?


private static class PartitionChangelog {
private long totalFileSize;
private final Map<Integer, List<DataFileMeta>> newFileChangelogFiles;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newChangelogFiles

partitionChangelogs.remove(partition);
}

private void emitAllPartitionsChanglogCompactTask() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionChangelogs.keySet().forEach(this::emitPartitionChangelogCompactTask);

}

private void emitPartitionChangelogCompactTask(BinaryRow partition) {
PartitionChangelog partitionChangelog = partitionChangelogs.get(partition);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionChangelog may be null or not?

public class ChangelogCompactTask implements Serializable {
private final long checkpointId;
private final BinaryRow partition;
private final Map<Integer, List<DataFileMeta>> newFileChangelogFiles;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newChangelogFiles

public List<Committable> doCompact(FileStoreTable table) throws Exception {
FileStorePathFactory pathFactory = table.store().pathFactory();

// copy all changelog files to a new big file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two for statement has lots of some code, you can avoid this.


// copy all changelog files to a new big file
for (Map.Entry<Integer, List<DataFileMeta>> entry : newFileChangelogFiles.entrySet()) {
Integer bucket = entry.getKey();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int

+ CompactedChangelogReadOnlyFormat.getIdentifier(
baseResult.meta.fileFormat())));

List<Committable> newCommittables = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List newCommittables = new ArrayList<>(bucketedResults.entrySet().size());

import org.apache.flink.api.common.typeutils.TypeSerializer;

/** Type information for {@link ChangelogCompactTask}. */
public class ChangelogTaskTypeInfo extends TypeInformation<ChangelogCompactTask> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line

private void copyFile(
FileStoreTable table, Path path, int bucket, boolean isCompactResult, DataFileMeta meta)
throws Exception {
if (outputStream == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyFile is only called in doCompact, so outputStream can be a local variable instead of a class member.

assertThat(compactedChangelogs2).hasSize(2);
assertThat(listAllFilesWithPrefix("changelog-")).isEmpty();

// write update data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra spaces

+ "'changelog-producer' = 'lookup', "
+ "'lookup-wait' = '%s', "
+ "'deletion-vectors.enabled' = '%s', "
+ "'changelog.compact.parallelism' = '%s'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this table option? Also why do you change write buffer size?

@tsreaper tsreaper merged commit c2c724a into apache:master Oct 31, 2024
12 of 13 checks passed
hang8929201 pushed a commit to hang8929201/paimon that referenced this pull request Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants