-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Filter data file while executing local orphan clean #4287
Conversation
candidateDeletes.removeAll(usedFiles); | ||
candidateDeletes.stream().map(candidates::get).forEach(fileCleaner); | ||
deleteFiles.addAll( | ||
candidateDeletes.stream().map(candidates::get).collect(Collectors.toList())); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
candidateDeletes.clear();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
I think this is a good optimization, but two questions. |
I'm not sure whether this optimization can be used in a distributed situation, because that means that the candidate set needs to be broadcast to each executor, bring additional consumption. At this time, the number of data file meta in each executor is not large, and there is no risk of OOM, unless there is a particularly serious case of data skew On the other hand, the process of joining condidate set and used file set actually implicitly includes the process of broadcasting and filtering. Therefore, the optimization benefits needs to be further analyzed. And the analyze in production environment with LocalOrphanClean will be given later. |
@@ -114,7 +118,7 @@ private List<String> getUsedFiles(String branch) { | |||
ManifestFile manifestFile = | |||
table.switchToBranch(branch).store().manifestFileFactory().create(); | |||
try { | |||
List<String> manifests = new ArrayList<>(); | |||
Set<String> manifests = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reading multiple snapshots, there will be duplicate manifests. Deduplication can improve efficiency.
Could you review it again? thanks @wwj6591812 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Thanks @bknbkn
Purpose
Local Remove Orphan Files OOM always occurred when collecting data file meta. Because data files in tables may be large (could reach tens of millions or more).
Relatively speaking, the number of candidate files is not large (ten thousand or hundreds of thousands).
So we can just collect data file meta in candidate files to reduce the probability of OOM occurrence.
Tests
API and Format
Documentation