New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[core] Support parallel reading of local orphan clean #4320

Merged

JingsongLi merged 5 commits into apache:master from bknbkn:parallel-orphan-clean

Oct 17, 2024

Contributor

bknbkn commented Oct 14, 2024 •

edited

Loading

Purpose

Increase parallel reading of snapshot, manifest, and data file meta to improve execution efficiency

Tests

Test in table with 7TB size with local mode:
Before this patch: local orphan clean will cost 20 min
After this patcgh, use the same sql in parallelism 16, the procedure only cost 5 min

API and Format

Documentation

baokainan added 2 commits

October 14, 2024 12:18


          [core] Support local orphan clean parallel reading

31b9277


          fix style

953f39f

bknbkn closed this

bknbkn reopened this

wwj6591812 reviewed

View reviewed changes

paimon-core/src/main/java/org/apache/paimon/operation/LocalOrphanFilesClean.java

-                                              });
-                          }
-                          usedFiles.addAll(dataFiles);
+                          randomlyOnlyExecute(

Contributor

wwj6591812 Oct 14, 2024

I think if manifests size is very big, use randomlyOnlyExecute may oom?
Do you think so?

Contributor Author

bknbkn Oct 14, 2024

I think there is indeed a risk of OOM.

But this can be avoided by reducing the parallelism (if it's equal to 1, then it's similar to serial execution before), and here it's just providing an ability to read in parallel.

If parallelism 1 is still OOM, then distributed mode is considered.

Contributor

wwj6591812 Oct 14, 2024

What about change randomlyOnlyExecute to sequentialBatchedExecute?

Contributor Author

bknbkn Oct 14, 2024

Yeah, I repleaced it with sequentialBatchedExecute since It has better control over memory

Contributor

JingsongLi Oct 15, 2024

In the end, you still need to put it in a set, so as long as there is no inflation during the reading process, there should be no risk of oom here.

Contributor Author

bknbkn Oct 16, 2024 •

edited

Loading

I test sequentialBatchedExecute randomlyOnlyExecute and randomlyExecute in 16 parallelism on table with 2PB of data , None of them happened OOM.

Besides sequentialBatchedExecute procedure cost 30 min , randomlyOnlyExecute and randomlyExecute cost about 20 min

So I prefer to use randomlyOnlyExecute because this method is more concise and no other variables are introduced, it can also reduce the gc process of temporary data files variable. The disadvantage of this method is that there will be waiting lock overhead while inserting useFiles, but this has no impact on the overall time consumption, because the overall time consumption mainly comes from IO.

What do you think? @JingsongLi @wwj6591812

paimon-core/src/main/java/org/apache/paimon/utils/SnapshotManager.java Outdated

		@@ -472,13 +476,18 @@ public List<Snapshot> safelyGetAllSnapshots() throws IOException {
		.map(id -> snapshotPath(id))

Contributor

wwj6591812 Oct 14, 2024

this::snapshotPath

Contributor Author

bknbkn Oct 14, 2024

fix

paimon-core/src/main/java/org/apache/paimon/utils/SnapshotManager.java Outdated

-                          } catch (FileNotFoundException ignored) {
-                          }
-                      }
+                      List<Snapshot> snapshots = Collections.synchronizedList(new ArrayList<>());

Contributor

wwj6591812 Oct 14, 2024

Collections.synchronizedList(new ArrayList<>(paths.size()));

Contributor Author

bknbkn Oct 14, 2024

fix

paimon-core/src/main/java/org/apache/paimon/utils/SnapshotManager.java Outdated

		@@ -489,18 +498,36 @@ public List<Changelog> safelyGetAllChangelogs() throws IOException {
		.map(id -> longLivedChangelogPath(id))

Contributor

wwj6591812 Oct 14, 2024

this::longLivedChangelogPath

Contributor Author

bknbkn Oct 14, 2024

fix

paimon-core/src/main/java/org/apache/paimon/utils/SnapshotManager.java Outdated

-                          } catch (FileNotFoundException ignored) {
-                          }
-                      }
+                      List<Changelog> changelogs = Collections.synchronizedList(new ArrayList<>());

Contributor

wwj6591812 Oct 14, 2024

Collections.synchronizedList(new ArrayList<>(paths.size()));

Contributor Author

bknbkn Oct 14, 2024

fix

paimon-core/src/main/java/org/apache/paimon/utils/SnapshotManager.java

                       return changelogs;
                   }
+                  private void collectSnapshots(Consumer<Path> pathConsumer, List<Path> paths)
+                          throws IOException {
+                      ExecutorService executor =

Contributor

wwj6591812 Oct 14, 2024

Why not use ThreadPoolUtils#createCachedThreadPool？

Contributor Author

bknbkn Oct 14, 2024

fix

baokainan added 2 commits

October 14, 2024 15:23


          fix style

29e56f0


          use sequentialBatchedExecute

43350a6

JingsongLi reviewed

View reviewed changes

paimon-core/src/main/java/org/apache/paimon/operation/LocalOrphanFilesClean.java Outdated

-                                                          .forEach(dataFiles::add);
-                                              });
+                          Iterable<String> dataFiles =
+                                  sequentialBatchedExecute(

Contributor

JingsongLi Oct 15, 2024

Why using sequentialBatchedExecute? Maybe use randomlyExecute?

Contributor Author

bknbkn Oct 17, 2024 •

edited

Loading

replace sequentialBatchedExecute with randomlyOnlyExecute , detail in #4320 (comment)


          Revert "use sequentialBatchedExecute"

0705afa

This reverts commit 43350a6.

bknbkn requested review from JingsongLi and wwj6591812

October 17, 2024 03:57

JingsongLi approved these changes

View reviewed changes

Contributor

JingsongLi left a comment

+1

JingsongLi merged commit 3dcc5ae into apache:master

12 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet