-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][broker] Avoid producer latency rise during internal ledger trimming operations #20649
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could ensure that all the calls to internalTrimLedgers and in the same executor, the only other point is in asyncTruncate
pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java
Line 4239 in d7186a6
internalTrimLedgers(true, future); |
We could change it from
FutureUtil.waitForAll(futures).thenAccept(p -> {
internalTrimLedgers(true, future);
}).exceptionally(e -> {
future.completeExceptionally(e);
return null;
});
to
FutureUtil.waitForAll(futures).thenAcceptAsync(p -> {
internalTrimLedgers(true, future);
}, scheduledExecutor).exceptionally(e -> {
future.completeExceptionally(e);
return null;
});
This way we are sure that it is not a API Service thread that executes this operation
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java
Show resolved
Hide resolved
@315157973 Do you have a flame graph or profiler result for the issue? If I understand correctly, the problem is the Trim operation exhausted the BookKeeper main thread pool or exhausted the CPU resources of the pod/node? If it exhausted the CPU resources of the pod/node, it looks like the only way can resolve the issue is to make the Trim operation works more efficiently. |
The CPU is not full. Each trim operation is time-consuming, because there are locks and other operations that will wait. Assuming that a task takes 0.1ms, 10,000 partitions will take 1s, which will slow down the tasks queued later |
@@ -2451,15 +2451,15 @@ private void trimConsumedLedgersInBackground() { | |||
|
|||
@Override | |||
public void trimConsumedLedgersInBackground(CompletableFuture<?> promise) { | |||
executor.execute(() -> internalTrimConsumedLedgers(promise)); | |||
scheduledExecutor.execute(() -> internalTrimConsumedLedgers(promise)); | |||
} | |||
|
|||
public void trimConsumedLedgersInBackground(boolean isTruncate, CompletableFuture<?> promise) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both tasks trim ledgers
and internalAsyncAddEntry
will try to acquire the lock synchronized(ml)
, If the task trim ledgers
running in the thread executor
, it avoids lock contention.
Then if we make the task trim ledgers
run in another thread, the publish task will also wait for the lock synchronized(ml)
. Maybe we need improve this scenario first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can resolve this busy thread problem first, because the synchronized symbol only affects one ManagedLedger object, but one bk main executor thread may handle a lot of trim tasks for different ManagedLedger objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not in two steps?
Thread pool isolation is necessary, trim should not run in the main thread pool
Next, let's optimize the time-consuming of trim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the tasks publish
and trim ledgers
all run in the thread executor
( it is a single thread pool, and the tasks of same topic will be executed in the same thread[L-1] ), right?[Q-1]
if yes[Q-1], the task trim ledgers
blocks the task publish
, which this PR tries to fix, right?[Q-2]
if yes[Q-2], but both tasks trim ledgers
[L-2] and internalAsyncAddEntry
[L-3] will try to acquire the same lock synchronized(ml)
, so the task trim ledgers
is still blocking the task publish
even if this patch is merged, right?[Q-3]
[L-1]: https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L348
[L-2]: https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L2608
[L-3]: https://github.com/apache/pulsar/blob/master/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L798
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lock is segmented and does not block each producer for a long time.
But if they are in the same thread pool, the thread pool will be occupied for a long time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lock is segmented and does not block each producer for a long time.
Make sense, the task trim ledgers
has two sub-tasks: Memory changes and Persist changes, and Memory changes would take the lock, Persist changes is an async task that will not take the lock.
But the sub-task Persist changes will not use the thread executor
, right?[Q-1]
if yes[Q-1], this patch doesn't have much effect on this issue, right?[Q-2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if yes[Q-1], this patch doesn't have much effect on this issue, right?[Q-2]
Assuming that one trim task increase 0.01ms, if there are tens of thousands of partitions, it will increase a lot.
Especially when there are many small packets sent to the cluster, this situation will get worse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that one trim task increase 0.01ms, if there are tens of thousands of partitions, it will increase a lot.
Especially when there are many small packets sent to the cluster, this situation will get worse
I agree with you. But I'm wondering if the current changes will solve the issue.
In the picture above, the logic of trim ledgers
can be split into two sections: Verification(not in the lock block) and Execution(in the lock block), and there have some async tasks in Section Execution which will run in other threads. The three tasks execute as the flow below.
Without lock | In lock | In other threads |
---|---|---|
Verification | Memory changes | Persist changes |
Since there is the lock synchronized(ml)
, we can only reduce the executor thread cost of the task Verification
, which is very simple. It only tries to acquire the lock and is done if it fails.
(Highlight) However, this change also makes the scramble of the lock synchronized(ml)
more frequent, which can worsen performance.
For example, the topic-a
and topic-b
were assigned the same thread: executor-1.
And the task trim ledgers
will be executed in the thread scheduled executor
, these tasks will be executed as flow below
time | executor-1 |
scheduled executor |
---|---|---|
1 | publish of topic-a |
|
2 | publish of topic-b |
|
3 | publish of topic-b |
start trim ledgers of topic-a |
4 | publish of topic-b |
get the ML lock of topic-a |
5 | publish of topic-b |
|
6 | publish of topic-a |
|
7 | waiting the lock, all publish of topic-a and topic-b will be blocked util the task trim ledgers of topic-a is complete |
So in step-4, if the task trim ledgers
of topic-a
is executed, the performance will be better; if the task trim ledgers
of topic-a
is executed, the performance will be worse. If more and more topics are assigned to the same thread, the performance will be better(maybe we can add a config to disable this feature, it is helpful for the users who have few topics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users with fewer partitions submit fewer tasks to the thread pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can improve the performance if there are a lot of topics in the broker and there are few threads in the BK client thread pool.
Make sense to me.
@315157973 Could you share the lock profile flame graph before and after this change? +1 for yubiao's concern about the lock |
My computer cannot upload pictures due to security rules, and this problem is easy to reproduce. |
@@ -2451,15 +2451,15 @@ private void trimConsumedLedgersInBackground() { | |||
|
|||
@Override | |||
public void trimConsumedLedgersInBackground(CompletableFuture<?> promise) { | |||
executor.execute(() -> internalTrimConsumedLedgers(promise)); | |||
scheduledExecutor.execute(() -> internalTrimConsumedLedgers(promise)); | |||
} | |||
|
|||
public void trimConsumedLedgersInBackground(boolean isTruncate, CompletableFuture<?> promise) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can improve the performance if there are a lot of topics in the broker and there are few threads in the BK client thread pool.
Make sense to me.
I'm trying to reproduce the issue... |
I have tried to reproduce the issue in the last two days. It looks like the killer of the publish latency when the broker trimming many Ledgers is the Zookeeper operation latency because all the pending add entry requests need to wait for the new Ledger creation. I don't see the trim operation exhausting the CPU resource or the contention introduced by the trim operation. Test StepsThe test is running or 3 brokers and 3 bookies, 10 CPU cores assigned to the broker, and 6 cores assigned to the bookie.
The P99 publish latency goes up to 15s during the Ledger trimming. The Zookeeper update latency also goes up to 15s during the Leger trimming. For each bookie, there are almost 1 million Ledgers have been deleted. No obvious broker CPU spike And the flame graph is made during the ledger trimming. trim_cpu_0707.html.txt ConclusionI doubt this PR can resolve the publish latency spike issue during the ledger trimming. From the above test, I don't see the bottleneck is from contention or CPU exhaustion. Maybe if you have a very good performance zookeeper, the zookeeper will not be the bottleneck. Then you can figure out other bottlenecks. But the zookeeper reached 8.5k ops per second. I don't think zookeeper is designed for such high-performance requirements Another solution is to introduce a throttling mechanism for the trim operation. So that we can prevent a huge impact on the cluster when trimming ledgers. It's not too big a problem for users to trim the ledgers later but it can make the trim operation smoothly. |
The pr had no activity for 30 days, mark with Stale label. |
Motivation
The executor in
ManagedLedgerImpl
uses theMainWorkerPool
of bookKeeper, and both produce and consume use this thread pool.When there are a large number of partitions in the cluster and the cluster is busy, trim ledger will cause the produce latency to rise from hundreds of milliseconds to seconds. We found that a large number of tasks are stuck in the queue of the thread pool
Trim ledger is a thread-safe operation and has no latency requirements, so we'd better separate it from
MainWorkerPool
of bookKeeperModifications
Separate it from
MainWorkerPool
of bookKeeperVerifying this change
315157973#9
Documentation
doc-not-needed
Matching PR in forked repository
315157973#9