[core] Filter data file while executing local orphan clean #4287

bknbkn · 2024-10-08T10:09:47Z

Purpose

Local Remove Orphan Files OOM always occurred when collecting data file meta. Because data files in tables may be large (could reach tens of millions or more).

Relatively speaking, the number of candidate files is not large (ten thousand or hundreds of thousands).

So we can just collect data file meta in candidate files to reduce the probability of OOM occurrence.

Tests

API and Format

Documentation

wwj6591812 · 2024-10-09T01:32:52Z

paimon-core/src/main/java/org/apache/paimon/operation/LocalOrphanFilesClean.java

+        candidateDeletes.removeAll(usedFiles);
+        candidateDeletes.stream().map(candidates::get).forEach(fileCleaner);
+        deleteFiles.addAll(
+                candidateDeletes.stream().map(candidates::get).collect(Collectors.toList()));



candidateDeletes.clear();

wwj6591812 · 2024-10-09T01:33:39Z

I think this is a good optimization, but two questions.
1、In your production environment, you encountered OOM here? By dump heap and analyze?
2、Since you have done LocalOrphanFilesClean, can you also add this optimization to the distributed OrphanFilesClean?

bknbkn · 2024-10-09T02:59:45Z

I'm not sure whether this optimization can be used in a distributed situation, because that means that the candidate set needs to be broadcast to each executor, bring additional consumption. At this time, the number of data file meta in each executor is not large, and there is no risk of OOM, unless there is a particularly serious case of data skew

On the other hand, the process of joining condidate set and used file set actually implicitly includes the process of broadcasting and filtering.

Therefore, the optimization benefits needs to be further analyzed.

And the analyze in production environment with LocalOrphanClean will be given later.

bknbkn · 2024-10-09T10:23:04Z

Test in table with 7TB size with local mode:

Before this patch, it will OOM when it collect 80 million data file metas:

After optimization, it only collect about 2000 manifests and about 60000 data file metas, the procedure only cost 20min on local machine:

bknbkn · 2024-10-09T10:25:46Z

paimon-core/src/main/java/org/apache/paimon/operation/LocalOrphanFilesClean.java

@@ -114,7 +118,7 @@ private List<String> getUsedFiles(String branch) {
        ManifestFile manifestFile =
                table.switchToBranch(branch).store().manifestFileFactory().create();
        try {
-            List<String> manifests = new ArrayList<>();
+            Set<String> manifests = new HashSet<>();


While reading multiple snapshots, there will be duplicate manifests. Deduplication can improve efficiency.

bknbkn · 2024-10-09T11:16:03Z

Could you review it again? thanks @wwj6591812

JingsongLi

Good point! Thanks @bknbkn

baokainan added 2 commits October 8, 2024 17:59

[core] Filter data file while executing local orphan clean

0a6635d

fix style

613e8eb

bknbkn closed this Oct 8, 2024

bknbkn reopened this Oct 8, 2024

wwj6591812 reviewed Oct 9, 2024

View reviewed changes

fix

dbc4567

bknbkn closed this Oct 9, 2024

bknbkn reopened this Oct 9, 2024

bknbkn closed this Oct 9, 2024

bknbkn reopened this Oct 9, 2024

bknbkn closed this Oct 9, 2024

bknbkn reopened this Oct 9, 2024

bknbkn closed this Oct 9, 2024

bknbkn reopened this Oct 9, 2024

optimize manifests

48eb411

bknbkn commented Oct 9, 2024

View reviewed changes

JingsongLi approved these changes Oct 9, 2024

View reviewed changes

JingsongLi merged commit 32d0191 into apache:master Oct 9, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Filter data file while executing local orphan clean #4287

[core] Filter data file while executing local orphan clean #4287

bknbkn commented Oct 8, 2024 •

edited

Loading

wwj6591812 Oct 9, 2024

bknbkn Oct 9, 2024

wwj6591812 commented Oct 9, 2024

bknbkn commented Oct 9, 2024 •

edited

Loading

bknbkn commented Oct 9, 2024 •

edited

Loading

bknbkn Oct 9, 2024

bknbkn commented Oct 9, 2024

JingsongLi left a comment

[core] Filter data file while executing local orphan clean #4287

[core] Filter data file while executing local orphan clean #4287

Conversation

bknbkn commented Oct 8, 2024 • edited Loading

Purpose

Tests

API and Format

Documentation

wwj6591812 Oct 9, 2024

Choose a reason for hiding this comment

bknbkn Oct 9, 2024

Choose a reason for hiding this comment

wwj6591812 commented Oct 9, 2024

bknbkn commented Oct 9, 2024 • edited Loading

bknbkn commented Oct 9, 2024 • edited Loading

bknbkn Oct 9, 2024

Choose a reason for hiding this comment

bknbkn commented Oct 9, 2024

JingsongLi left a comment

Choose a reason for hiding this comment

bknbkn commented Oct 8, 2024 •

edited

Loading

bknbkn commented Oct 9, 2024 •

edited

Loading

bknbkn commented Oct 9, 2024 •

edited

Loading