Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[improve][broker] Avoid producer latency rise during internal ledger trimming operations #20649

Closed
wants to merge 3 commits into from

Conversation

315157973
Copy link
Contributor

@315157973 315157973 commented Jun 26, 2023

Motivation

The executor in ManagedLedgerImpl uses the MainWorkerPool of bookKeeper, and both produce and consume use this thread pool.

When there are a large number of partitions in the cluster and the cluster is busy, trim ledger will cause the produce latency to rise from hundreds of milliseconds to seconds. We found that a large number of tasks are stuck in the queue of the thread pool

Trim ledger is a thread-safe operation and has no latency requirements, so we'd better separate it from MainWorkerPool of bookKeeper

Modifications

Separate it from MainWorkerPool of bookKeeper

Verifying this change

315157973#9

Documentation

  • doc-not-needed

Matching PR in forked repository

315157973#9

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 26, 2023
@315157973 315157973 changed the title [fix][broker] Avoid producer latency rise [improve][broker] Avoid producer latency rise Jun 26, 2023
@eolivelli eolivelli changed the title [improve][broker] Avoid producer latency rise [improve][broker] Avoid producer latency rise during internal ledger trimming operations Jun 26, 2023
@eolivelli eolivelli requested review from lhotari, merlimat and codelipenghui and removed request for lhotari June 26, 2023 13:37
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could ensure that all the calls to internalTrimLedgers and in the same executor, the only other point is in asyncTruncate

We could change it from

FutureUtil.waitForAll(futures).thenAccept(p -> {
            internalTrimLedgers(true, future);
        }).exceptionally(e -> {
            future.completeExceptionally(e);
            return null;
        });

to

FutureUtil.waitForAll(futures).thenAcceptAsync(p -> {
            internalTrimLedgers(true, future);
        }, scheduledExecutor).exceptionally(e -> {
            future.completeExceptionally(e);
            return null;
        });

This way we are sure that it is not a API Service thread that executes this operation

@315157973 315157973 requested a review from BewareMyPower June 29, 2023 11:24
@315157973 315157973 self-assigned this Jun 29, 2023
@315157973 315157973 requested a review from Technoboy- June 29, 2023 12:24
@codelipenghui
Copy link
Contributor

@315157973 Do you have a flame graph or profiler result for the issue? If I understand correctly, the problem is the Trim operation exhausted the BookKeeper main thread pool or exhausted the CPU resources of the pod/node? If it exhausted the CPU resources of the pod/node, it looks like the only way can resolve the issue is to make the Trim operation works more efficiently.

@315157973
Copy link
Contributor Author

@315157973 Do you have a flame graph or profiler result for the issue? If I understand correctly, the problem is the Trim operation exhausted the BookKeeper main thread pool or exhausted the CPU resources of the pod/node? If it exhausted the CPU resources of the pod/node, it looks like the only way can resolve the issue is to make the Trim operation works more efficiently.

The CPU is not full. Each trim operation is time-consuming, because there are locks and other operations that will wait. Assuming that a task takes 0.1ms, 10,000 partitions will take 1s, which will slow down the tasks queued later

@@ -2451,15 +2451,15 @@ private void trimConsumedLedgersInBackground() {

@Override
public void trimConsumedLedgersInBackground(CompletableFuture<?> promise) {
executor.execute(() -> internalTrimConsumedLedgers(promise));
scheduledExecutor.execute(() -> internalTrimConsumedLedgers(promise));
}

public void trimConsumedLedgersInBackground(boolean isTruncate, CompletableFuture<?> promise) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both tasks trim ledgers and internalAsyncAddEntry will try to acquire the lock synchronized(ml), If the task trim ledgers running in the thread executor, it avoids lock contention.

Then if we make the task trim ledgers run in another thread, the publish task will also wait for the lock synchronized(ml). Maybe we need improve this scenario first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can resolve this busy thread problem first, because the synchronized symbol only affects one ManagedLedger object, but one bk main executor thread may handle a lot of trim tasks for different ManagedLedger objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not in two steps?
Thread pool isolation is necessary, trim should not run in the main thread pool
Next, let's optimize the time-consuming of trim

Copy link
Contributor

@poorbarcode poorbarcode Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@315157973

Now the tasks publish and trim ledgers all run in the thread executor( it is a single thread pool, and the tasks of same topic will be executed in the same thread[L-1] ), right?[Q-1]


if yes[Q-1], the task trim ledgers blocks the task publish, which this PR tries to fix, right?[Q-2]


if yes[Q-2], but both tasks trim ledgers[L-2] and internalAsyncAddEntry[L-3] will try to acquire the same lock synchronized(ml), so the task trim ledgers is still blocking the task publish even if this patch is merged, right?[Q-3]


[L-1]: https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L348
[L-2]: https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L2608
[L-3]: https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L798

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock is segmented and does not block each producer for a long time.
But if they are in the same thread pool, the thread pool will be occupied for a long time

Copy link
Contributor

@poorbarcode poorbarcode Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@315157973

This lock is segmented and does not block each producer for a long time.

Make sense, the task trim ledgers has two sub-tasks: Memory changes and Persist changes, and Memory changes would take the lock, Persist changes is an async task that will not take the lock.


But the sub-task Persist changes will not use the thread executor, right?[Q-1]


if yes[Q-1], this patch doesn't have much effect on this issue, right?[Q-2]

Copy link
Contributor Author

@315157973 315157973 Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if yes[Q-1], this patch doesn't have much effect on this issue, right?[Q-2]

Assuming that one trim task increase 0.01ms, if there are tens of thousands of partitions, it will increase a lot.
Especially when there are many small packets sent to the cluster, this situation will get worse

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@315157973

Assuming that one trim task increase 0.01ms, if there are tens of thousands of partitions, it will increase a lot.
Especially when there are many small packets sent to the cluster, this situation will get worse

I agree with you. But I'm wondering if the current changes will solve the issue.
截屏2023-07-03 22 29 13

In the picture above, the logic of trim ledgers can be split into two sections: Verification(not in the lock block) and Execution(in the lock block), and there have some async tasks in Section Execution which will run in other threads. The three tasks execute as the flow below.

Without lock In lock In other threads
Verification Memory changes Persist changes

Since there is the lock synchronized(ml), we can only reduce the executor thread cost of the task Verification, which is very simple. It only tries to acquire the lock and is done if it fails.

(Highlight) However, this change also makes the scramble of the lock synchronized(ml) more frequent, which can worsen performance.

For example, the topic-a and topic-b were assigned the same thread: executor-1. And the task trim ledgers will be executed in the thread scheduled executor, these tasks will be executed as flow below

time executor-1 scheduled executor
1 publish of topic-a
2 publish of topic-b
3 publish of topic-b start trim ledgers of topic-a
4 publish of topic-b get the ML lock of topic-a
5 publish of topic-b
6 publish of topic-a
7 waiting the lock, all publish of topic-a and topic-b will be blocked util the task trim ledgers of topic-a is complete

So in step-4, if the task trim ledgers of topic-a is executed, the performance will be better; if the task trim ledgers of topic-a is executed, the performance will be worse. If more and more topics are assigned to the same thread, the performance will be better(maybe we can add a config to disable this feature, it is helpful for the users who have few topics).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users with fewer partitions submit fewer tasks to the thread pool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can improve the performance if there are a lot of topics in the broker and there are few threads in the BK client thread pool.

Make sense to me.

@hangc0276
Copy link
Contributor

@315157973 Do you have a flame graph or profiler result for the issue? If I understand correctly, the problem is the Trim operation exhausted the BookKeeper main thread pool or exhausted the CPU resources of the pod/node? If it exhausted the CPU resources of the pod/node, it looks like the only way can resolve the issue is to make the Trim operation works more efficiently.

The CPU is not full. Each trim operation is time-consuming, because there are locks and other operations that will wait. Assuming that a task takes 0.1ms, 10,000 partitions will take 1s, which will slow down the tasks queued later

@315157973 Could you share the lock profile flame graph before and after this change? +1 for yubiao's concern about the lock

@315157973
Copy link
Contributor Author

315157973 commented Jul 3, 2023

@315157973 Could you share the lock profile flame graph before and after this change? +1 for yubiao's concern about the lock

My computer cannot upload pictures due to security rules, and this problem is easy to reproduce.
Close batch, use the producer to send a large number of small messages, make sure the CPU of broker is above 60%, and each broker has about 30000 partitions

@@ -2451,15 +2451,15 @@ private void trimConsumedLedgersInBackground() {

@Override
public void trimConsumedLedgersInBackground(CompletableFuture<?> promise) {
executor.execute(() -> internalTrimConsumedLedgers(promise));
scheduledExecutor.execute(() -> internalTrimConsumedLedgers(promise));
}

public void trimConsumedLedgersInBackground(boolean isTruncate, CompletableFuture<?> promise) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can improve the performance if there are a lot of topics in the broker and there are few threads in the BK client thread pool.

Make sense to me.

@codelipenghui
Copy link
Contributor

I'm trying to reproduce the issue...

@codelipenghui
Copy link
Contributor

I have tried to reproduce the issue in the last two days. It looks like the killer of the publish latency when the broker trimming many Ledgers is the Zookeeper operation latency because all the pending add entry requests need to wait for the new Ledger creation. I don't see the trim operation exhausting the CPU resource or the contention introduced by the trim operation.

Test Steps

The test is running or 3 brokers and 3 bookies, 10 CPU cores assigned to the broker, and 6 cores assigned to the bookie.
I have changed the max ledger rollover time to 2 min to make sure each topic will have enough ledgers.

  1. Start a test to write and read messages on 30k topics (Non-partitioned topic)
  2. Set the data retention for 2 hours
  3. Waiting for 2 hours to make sure we have many ledgers will be trimmed
  4. Set the data retention to 2 mins after 2 hours
  5. Check the dashboard, the publish latency has been impacted.

The P99 publish latency goes up to 15s during the Ledger trimming.

image

The Zookeeper update latency also goes up to 15s during the Leger trimming.

image

For each bookie, there are almost 1 million Ledgers have been deleted.

image

No obvious broker CPU spike

image

And the flame graph is made during the ledger trimming.

trim_cpu_0707.html.txt
trim_lock_0707.html.txt

Conclusion

I doubt this PR can resolve the publish latency spike issue during the ledger trimming. From the above test, I don't see the bottleneck is from contention or CPU exhaustion. Maybe if you have a very good performance zookeeper, the zookeeper will not be the bottleneck. Then you can figure out other bottlenecks. But the zookeeper reached 8.5k ops per second. I don't think zookeeper is designed for such high-performance requirements

image

Another solution is to introduce a throttling mechanism for the trim operation. So that we can prevent a huge impact on the cluster when trimming ledgers. It's not too big a problem for users to trim the ledgers later but it can make the trim operation smoothly.

@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Aug 31, 2023
@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023
@coderzc coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024
@315157973 315157973 closed this Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs ready-to-test Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants