Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Filter data file while executing local orphan clean #4287

Merged
merged 4 commits into from
Oct 9, 2024

Conversation

bknbkn
Copy link
Contributor

@bknbkn bknbkn commented Oct 8, 2024

Purpose

Local Remove Orphan Files OOM always occurred when collecting data file meta. Because data files in tables may be large (could reach tens of millions or more).

Relatively speaking, the number of candidate files is not large (ten thousand or hundreds of thousands).

So we can just collect data file meta in candidate files to reduce the probability of OOM occurrence.

Tests

API and Format

Documentation

@bknbkn bknbkn closed this Oct 8, 2024
@bknbkn bknbkn reopened this Oct 8, 2024
candidateDeletes.removeAll(usedFiles);
candidateDeletes.stream().map(candidates::get).forEach(fileCleaner);
deleteFiles.addAll(
candidateDeletes.stream().map(candidates::get).collect(Collectors.toList()));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

candidateDeletes.clear();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@wwj6591812
Copy link
Contributor

I think this is a good optimization, but two questions.
1、In your production environment, you encountered OOM here? By dump heap and analyze?
2、Since you have done LocalOrphanFilesClean, can you also add this optimization to the distributed OrphanFilesClean?

@bknbkn
Copy link
Contributor Author

bknbkn commented Oct 9, 2024

I'm not sure whether this optimization can be used in a distributed situation, because that means that the candidate set needs to be broadcast to each executor, bring additional consumption. At this time, the number of data file meta in each executor is not large, and there is no risk of OOM, unless there is a particularly serious case of data skew

On the other hand, the process of joining condidate set and used file set actually implicitly includes the process of broadcasting and filtering.

Therefore, the optimization benefits needs to be further analyzed.

And the analyze in production environment with LocalOrphanClean will be given later.

@bknbkn bknbkn closed this Oct 9, 2024
@bknbkn bknbkn reopened this Oct 9, 2024
@bknbkn bknbkn closed this Oct 9, 2024
@bknbkn bknbkn reopened this Oct 9, 2024
@bknbkn bknbkn closed this Oct 9, 2024
@bknbkn bknbkn reopened this Oct 9, 2024
@bknbkn bknbkn closed this Oct 9, 2024
@bknbkn bknbkn reopened this Oct 9, 2024
@bknbkn
Copy link
Contributor Author

bknbkn commented Oct 9, 2024

Test in table with 7TB size with local mode:

Before this patch, it will OOM when it collect 80 million data file metas:
490d8adf-0f08-42f8-8453-01e60ba9849f
23436c6d-280d-481c-988c-7beead657e69

After optimization, it only collect about 2000 manifests and about 60000 data file metas, the procedure only cost 20min on local machine:
e46b828e-1013-4e81-86c2-b92e5daae241
c6ff7173-50f9-4e50-ba72-9d02bf712a20

@@ -114,7 +118,7 @@ private List<String> getUsedFiles(String branch) {
ManifestFile manifestFile =
table.switchToBranch(branch).store().manifestFileFactory().create();
try {
List<String> manifests = new ArrayList<>();
Set<String> manifests = new HashSet<>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reading multiple snapshots, there will be duplicate manifests. Deduplication can improve efficiency.

@bknbkn
Copy link
Contributor Author

bknbkn commented Oct 9, 2024

Could you review it again? thanks @wwj6591812

Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Thanks @bknbkn

@JingsongLi JingsongLi merged commit 32d0191 into apache:master Oct 9, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants