Perf[MQB]: make independent item pools for channels #479
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Previously, we used one shared concurrent item pool to populate items going to all
mqbnet::Channel
s. However, on a high throughput the mechanism to ensure concurrency in this pool doesn't keep up and slows down all threads accessing this pool instead. The higher the frequency, the more performance degradation. With our throughput, the negative effect is moderate.Also, the average frequency doesn't even matter, because this effect can slow down the broker during short spikes of messages.
Also, this PR moves item pool initialization from top-level
mqbnet::TransportManager
directly tomqbnet::Channel
Profiler
Before:
After:
Isolated benchmark
What this effect could be, with a simple bench program.
By binding the same pool or separate ones (comment/uncomment in the snippet) we can measure the total time to perform the same number of concurrent allocations and deallocations with the given number of threads.
The most important factor here is frequency of concurrent pool calls. The more frequently we do this, the more chances to hit concurrency mechanism. To emulate this, there is an inner "work" loop in the code. If this inner loop executes for 100 microseconds or more, the difference between one shared pool and different independent ones is not visible. However, when we go down to 1 microsecond and lower, the difference becomes huge.
With 0.5 microseconds "work", the execution times are:
Counting allocator stats table
Before:
The new counting allocator stats table looks like this: