[improve][broker] Avoid producer latency rise during internal ledger trimming operations #20649

315157973 · 2023-06-26T13:32:08Z

Motivation

The executor in ManagedLedgerImpl uses the MainWorkerPool of bookKeeper, and both produce and consume use this thread pool.

When there are a large number of partitions in the cluster and the cluster is busy, trim ledger will cause the produce latency to rise from hundreds of milliseconds to seconds. We found that a large number of tasks are stuck in the queue of the thread pool

Trim ledger is a thread-safe operation and has no latency requirements, so we'd better separate it from MainWorkerPool of bookKeeper

Modifications

Separate it from MainWorkerPool of bookKeeper

Verifying this change

315157973#9

Documentation

doc-not-needed

Matching PR in forked repository

315157973#9

eolivelli

We could ensure that all the calls to internalTrimLedgers and in the same executor, the only other point is in asyncTruncate

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

Line 4239 in d7186a6

internalTrimLedgers(true, future);

We could change it from

FutureUtil.waitForAll(futures).thenAccept(p -> {
            internalTrimLedgers(true, future);
        }).exceptionally(e -> {
            future.completeExceptionally(e);
            return null;
        });

to

FutureUtil.waitForAll(futures).thenAcceptAsync(p -> {
            internalTrimLedgers(true, future);
        }, scheduledExecutor).exceptionally(e -> {
            future.completeExceptionally(e);
            return null;
        });

This way we are sure that it is not a API Service thread that executes this operation

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

codelipenghui · 2023-07-03T06:41:00Z

@315157973 Do you have a flame graph or profiler result for the issue? If I understand correctly, the problem is the Trim operation exhausted the BookKeeper main thread pool or exhausted the CPU resources of the pod/node? If it exhausted the CPU resources of the pod/node, it looks like the only way can resolve the issue is to make the Trim operation works more efficiently.

315157973 · 2023-07-03T08:18:38Z

@315157973 Do you have a flame graph or profiler result for the issue? If I understand correctly, the problem is the Trim operation exhausted the BookKeeper main thread pool or exhausted the CPU resources of the pod/node? If it exhausted the CPU resources of the pod/node, it looks like the only way can resolve the issue is to make the Trim operation works more efficiently.

The CPU is not full. Each trim operation is time-consuming, because there are locks and other operations that will wait. Assuming that a task takes 0.1ms, 10,000 partitions will take 1s, which will slow down the tasks queued later

poorbarcode · 2023-07-03T08:39:27Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

@@ -2451,15 +2451,15 @@ private void trimConsumedLedgersInBackground() {

    @Override
    public void trimConsumedLedgersInBackground(CompletableFuture<?> promise) {
-        executor.execute(() -> internalTrimConsumedLedgers(promise));
+        scheduledExecutor.execute(() -> internalTrimConsumedLedgers(promise));
    }

    public void trimConsumedLedgersInBackground(boolean isTruncate, CompletableFuture<?> promise) {


Both tasks trim ledgers and internalAsyncAddEntry will try to acquire the lock synchronized(ml), If the task trim ledgers running in the thread executor, it avoids lock contention.

Then if we make the task trim ledgers run in another thread, the publish task will also wait for the lock synchronized(ml). Maybe we need improve this scenario first

Maybe we can resolve this busy thread problem first, because the synchronized symbol only affects one ManagedLedger object, but one bk main executor thread may handle a lot of trim tasks for different ManagedLedger objects.

Why not in two steps?
Thread pool isolation is necessary, trim should not run in the main thread pool
Next, let's optimize the time-consuming of trim

@315157973

Now the tasks publish and trim ledgers all run in the thread executor( it is a single thread pool, and the tasks of same topic will be executed in the same thread^[L-1] ), right?^[Q-1]

if yes^[Q-1], the task trim ledgers blocks the task publish, which this PR tries to fix, right?^[Q-2]

if yes^[Q-2], but both tasks trim ledgers^[L-2] and internalAsyncAddEntry^[L-3] will try to acquire the same lock synchronized(ml), so the task trim ledgers is still blocking the task publish even if this patch is merged, right?[Q-3]

[L-1]: https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L348
[L-2]: https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L2608
[L-3]: https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L798

This lock is segmented and does not block each producer for a long time.
But if they are in the same thread pool, the thread pool will be occupied for a long time

@315157973

This lock is segmented and does not block each producer for a long time.

Make sense, the task trim ledgers has two sub-tasks: Memory changes and Persist changes, and Memory changes would take the lock, Persist changes is an async task that will not take the lock.

But the sub-task Persist changes will not use the thread executor, right?^[Q-1]

if yes^[Q-1], this patch doesn't have much effect on this issue, right?^[Q-2]

if yes[Q-1], this patch doesn't have much effect on this issue, right?[Q-2]

Assuming that one trim task increase 0.01ms, if there are tens of thousands of partitions, it will increase a lot.
Especially when there are many small packets sent to the cluster， this situation will get worse

@315157973

Assuming that one trim task increase 0.01ms, if there are tens of thousands of partitions, it will increase a lot.
Especially when there are many small packets sent to the cluster， this situation will get worse

I agree with you. But I'm wondering if the current changes will solve the issue.

In the picture above, the logic of trim ledgers can be split into two sections: Verification(not in the lock block) and Execution(in the lock block), and there have some async tasks in Section Execution which will run in other threads. The three tasks execute as the flow below.

Without lock In lock In other threads

Verification Memory changes Persist changes

Since there is the lock synchronized(ml), we can only reduce the executor thread cost of the task Verification, which is very simple. It only tries to acquire the lock and is done if it fails.

(Highlight) However, this change also makes the scramble of the lock synchronized(ml) more frequent, which can worsen performance.

For example, the topic-a and topic-b were assigned the same thread: executor-1. And the task trim ledgers will be executed in the thread scheduled executor, these tasks will be executed as flow below

time executor-1 scheduled executor

1 publish of topic-a

2 publish of topic-b

3 publish of topic-b start trim ledgers of topic-a

4 publish of topic-b get the ML lock of topic-a

5 publish of topic-b

6 publish of topic-a

7 waiting the lock, all publish of topic-a and topic-b will be blocked util the task trim ledgers of topic-a is complete

So in step-4, if the task trim ledgers of topic-a is executed, the performance will be better; if the task trim ledgers of topic-a is executed, the performance will be worse. If more and more topics are assigned to the same thread, the performance will be better(maybe we can add a config to disable this feature, it is helpful for the users who have few topics).

Users with fewer partitions submit fewer tasks to the thread pool

This change can improve the performance if there are a lot of topics in the broker and there are few threads in the BK client thread pool.

Make sense to me.

hangc0276 · 2023-07-03T11:48:46Z

@315157973 Do you have a flame graph or profiler result for the issue? If I understand correctly, the problem is the Trim operation exhausted the BookKeeper main thread pool or exhausted the CPU resources of the pod/node? If it exhausted the CPU resources of the pod/node, it looks like the only way can resolve the issue is to make the Trim operation works more efficiently.

The CPU is not full. Each trim operation is time-consuming, because there are locks and other operations that will wait. Assuming that a task takes 0.1ms, 10,000 partitions will take 1s, which will slow down the tasks queued later

@315157973 Could you share the lock profile flame graph before and after this change? +1 for yubiao's concern about the lock

315157973 · 2023-07-03T13:32:13Z

@315157973 Could you share the lock profile flame graph before and after this change? +1 for yubiao's concern about the lock

My computer cannot upload pictures due to security rules, and this problem is easy to reproduce.
Close batch, use the producer to send a large number of small messages, make sure the CPU of broker is above 60%, and each broker has about 30000 partitions

poorbarcode · 2023-07-04T08:33:51Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

@@ -2451,15 +2451,15 @@ private void trimConsumedLedgersInBackground() {

    @Override
    public void trimConsumedLedgersInBackground(CompletableFuture<?> promise) {
-        executor.execute(() -> internalTrimConsumedLedgers(promise));
+        scheduledExecutor.execute(() -> internalTrimConsumedLedgers(promise));
    }

    public void trimConsumedLedgersInBackground(boolean isTruncate, CompletableFuture<?> promise) {


This change can improve the performance if there are a lot of topics in the broker and there are few threads in the BK client thread pool.

Make sense to me.

codelipenghui · 2023-07-04T09:22:54Z

I'm trying to reproduce the issue...

codelipenghui · 2023-07-07T06:36:52Z

I have tried to reproduce the issue in the last two days. It looks like the killer of the publish latency when the broker trimming many Ledgers is the Zookeeper operation latency because all the pending add entry requests need to wait for the new Ledger creation. I don't see the trim operation exhausting the CPU resource or the contention introduced by the trim operation.

Test Steps

The test is running or 3 brokers and 3 bookies, 10 CPU cores assigned to the broker, and 6 cores assigned to the bookie.
I have changed the max ledger rollover time to 2 min to make sure each topic will have enough ledgers.

Start a test to write and read messages on 30k topics (Non-partitioned topic)
Set the data retention for 2 hours
Waiting for 2 hours to make sure we have many ledgers will be trimmed
Set the data retention to 2 mins after 2 hours
Check the dashboard, the publish latency has been impacted.

The P99 publish latency goes up to 15s during the Ledger trimming.

The Zookeeper update latency also goes up to 15s during the Leger trimming.

For each bookie, there are almost 1 million Ledgers have been deleted.

No obvious broker CPU spike

And the flame graph is made during the ledger trimming.

trim_cpu_0707.html.txt
trim_lock_0707.html.txt

Conclusion

I doubt this PR can resolve the publish latency spike issue during the ledger trimming. From the above test, I don't see the bottleneck is from contention or CPU exhaustion. Maybe if you have a very good performance zookeeper, the zookeeper will not be the bottleneck. Then you can figure out other bottlenecks. But the zookeeper reached 8.5k ops per second. I don't think zookeeper is designed for such high-performance requirements

Another solution is to introduce a throttling mechanism for the trim operation. So that we can prevent a huge impact on the cluster when trimming ledgers. It's not too big a problem for users to trim the ledgers later but it can make the trim operation smoothly.

github-actions · 2023-08-31T01:47:21Z

The pr had no activity for 30 days, mark with Stale label.

[fix][broker] Avoid producer latency rise

5b839e3

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 26, 2023

315157973 changed the title ~~[fix][broker] Avoid producer latency rise~~ [improve][broker] Avoid producer latency rise Jun 26, 2023

eolivelli changed the title ~~[improve][broker] Avoid producer latency rise~~ [improve][broker] Avoid producer latency rise during internal ledger trimming operations Jun 26, 2023

eolivelli requested review from lhotari, merlimat and codelipenghui and removed request for lhotari June 26, 2023 13:37

eolivelli added the ready-to-test label Jun 26, 2023

eolivelli reviewed Jun 26, 2023

View reviewed changes

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Show resolved Hide resolved

[fix][broker] Address comment

353e555

315157973 requested a review from BewareMyPower June 29, 2023 11:24

315157973 self-assigned this Jun 29, 2023

315157973 requested a review from Technoboy- June 29, 2023 12:24

Merge remote-tracking branch 'upstream/master' into executor

a841756

poorbarcode reviewed Jul 3, 2023

View reviewed changes

poorbarcode approved these changes Jul 4, 2023

View reviewed changes

gaoran10 approved these changes Jul 4, 2023

View reviewed changes

mattisonchao approved these changes Jul 4, 2023

View reviewed changes

codelipenghui mentioned this pull request Jul 12, 2023

Trim ledger throttling streamnative/pulsar-archived#5823

Open

2 tasks

Technoboy- added this to the 3.2.0 milestone Jul 31, 2023

github-actions bot added the Stale label Aug 31, 2023

Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023

coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024

315157973 closed this Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve][broker] Avoid producer latency rise during internal ledger trimming operations #20649

[improve][broker] Avoid producer latency rise during internal ledger trimming operations #20649

315157973 commented Jun 26, 2023 •

edited

Loading

eolivelli left a comment

codelipenghui commented Jul 3, 2023

315157973 commented Jul 3, 2023

poorbarcode Jul 3, 2023

gaoran10 Jul 3, 2023

315157973 Jul 3, 2023

poorbarcode Jul 3, 2023 •

edited

Loading

315157973 Jul 3, 2023

poorbarcode Jul 3, 2023 •

edited

Loading

315157973 Jul 3, 2023 •

edited

Loading

poorbarcode Jul 3, 2023

315157973 Jul 4, 2023

poorbarcode Jul 4, 2023

hangc0276 commented Jul 3, 2023

315157973 commented Jul 3, 2023 •

edited

Loading

poorbarcode Jul 4, 2023

codelipenghui commented Jul 4, 2023

codelipenghui commented Jul 7, 2023

github-actions bot commented Aug 31, 2023

time	`executor-1`	`scheduled executor`
1	publish of `topic-a`
2	publish of `topic-b`
3	publish of `topic-b`	start `trim ledgers` of `topic-a`
4	publish of `topic-b`	get the ML lock of `topic-a`
5	publish of `topic-b`
6	publish of `topic-a`
7	waiting the lock, all publish of `topic-a` and `topic-b` will be blocked util the task `trim ledgers` of `topic-a` is complete

[improve][broker] Avoid producer latency rise during internal ledger trimming operations #20649

[improve][broker] Avoid producer latency rise during internal ledger trimming operations #20649

Conversation

315157973 commented Jun 26, 2023 • edited Loading

Motivation

Modifications

Verifying this change

Documentation

Matching PR in forked repository

eolivelli left a comment

Choose a reason for hiding this comment

codelipenghui commented Jul 3, 2023

315157973 commented Jul 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poorbarcode Jul 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poorbarcode Jul 3, 2023 • edited Loading

Choose a reason for hiding this comment

315157973 Jul 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hangc0276 commented Jul 3, 2023

315157973 commented Jul 3, 2023 • edited Loading

Choose a reason for hiding this comment

codelipenghui commented Jul 4, 2023

codelipenghui commented Jul 7, 2023

Test Steps

Conclusion

github-actions bot commented Aug 31, 2023

315157973 commented Jun 26, 2023 •

edited

Loading

poorbarcode Jul 3, 2023 •

edited

Loading

poorbarcode Jul 3, 2023 •

edited

Loading

315157973 Jul 3, 2023 •

edited

Loading

315157973 commented Jul 3, 2023 •

edited

Loading