Perf[MQB]: make independent item pools for channels #479

678098 · 2024-10-26T04:39:10Z

Summary

Previously, we used one shared concurrent item pool to populate items going to all mqbnet::Channels. However, on a high throughput the mechanism to ensure concurrency in this pool doesn't keep up and slows down all threads accessing this pool instead. The higher the frequency, the more performance degradation. With our throughput, the negative effect is moderate.

Also, the average frequency doesn't even matter, because this effect can slow down the broker during short spikes of messages.

Also, this PR moves item pool initialization from top-level mqbnet::TransportManager directly to mqbnet::Channel

Profiler

Before:

After:

Isolated benchmark

What this effect could be, with a simple bench program.
By binding the same pool or separate ones (comment/uncomment in the snippet) we can measure the total time to perform the same number of concurrent allocations and deallocations with the given number of threads.

static void threadFunc(size_t iters, bdlma::ConcurrentPool *pool, bslmt::Semaphore *ready) {
    for (size_t i = 0; i < iters; i++) {    
        void *ptr = pool->allocate();

        // 0.5 us per one inner loop exec
        unsigned long long j = (unsigned long long)ptr;
        for (size_t m = 0; m < 1000; m++) {
            if (m * m - j == 0) {
                bsl::cout << " ";
            }
        }
        pool->deallocate(ptr);
    }
    ready->post();
}

static void testPerf() {
    const size_t k_NUM_THREADS = 10;

    bdlmt::ThreadPool threadPool(
            bslmt::ThreadAttributes(),        // default
            k_NUM_THREADS,                    // minThreads
            k_NUM_THREADS,                    // maxThreads
            bsl::numeric_limits<int>::max(),  // maxIdleTime
            s_allocator_p);
 
    bdlma::ConcurrentPool pool(345, bsls::BlockGrowth::BSLS_CONSTANT, s_allocator_p);
    bsl::vector<bdlma::ConcurrentPool*> pools;
    for (size_t i = 0; i < k_NUM_THREADS; i++) {
        pools.push_back(new bdlma::ConcurrentPool(345, bsls::BlockGrowth::BSLS_CONSTANT, s_allocator_p));
    }

    threadPool.start();

    bslmt::Semaphore ready;
    bsls::Types::Int64 begin = bsls::TimeUtil::getTimer();
    for (size_t i = 0; i < k_NUM_THREADS; i++) {
        threadPool.enqueueJob(bdlf::BindUtil::bindS(
                s_allocator_p,
                &threadFunc,
                1000000,
                // &pool,
                pools[i],
                &ready));
    }

    for (size_t i = 0; i < k_NUM_THREADS; i++) {
        ready.wait();
    }
    bsls::Types::Int64 end = bsls::TimeUtil::getTimer();

    bsl::cout << "dt: " << bmqu::PrintUtil::prettyTimeInterval(end - begin) << "\n";

    threadPool.drain();

    for (size_t i = 0; i < k_NUM_THREADS; i++) {
        delete pools[i];
    }
}

The most important factor here is frequency of concurrent pool calls. The more frequently we do this, the more chances to hit concurrency mechanism. To emulate this, there is an inner "work" loop in the code. If this inner loop executes for 100 microseconds or more, the difference between one shared pool and different independent ones is not visible. However, when we go down to 1 microsecond and lower, the difference becomes huge.
With 0.5 microseconds "work", the execution times are:

One shared pool: 3.11 s
Independent pools: 0.54 s

Counting allocator stats table

Before:

    TransportManager             |       1,078,656|        |           2,196,816|  98,941,262|       9|    98,940,453|       9
      *direct*                   |       1,076,240|        |           2,193,776|  42,391,201|       3|    42,390,433|       3
      Channel-node5              |             416|        |              20,656|  11,310,006|       1|    11,309,999|       1
      Channel-node2              |             416|        |              31,552|  11,310,018|       2|    11,310,011|       2
      Channel-node4              |             416|        |              25,712|  11,310,012|       1|    11,310,005|       1
      Channel-node1              |             416|        |              25,424|  11,310,009|       1|    11,310,002|       1
      Channel-node3              |             416|        |              21,840|  11,310,010|       1|    11,310,003|       1
      Channel-node0              |             336|        |                 336|           6|        |             0|

The new counting allocator stats table looks like this:

    TransportManager             |       4,488,528|    -976|           5,566,480| 242,776,559|  97,161|   242,773,282|  97,165
      *direct*                   |           1,392|        |               1,392|          12|        |             2|        
      Interface46531             |       3,499,904|    -752|           4,564,272| 111,313,737|  41,211|   111,310,648|  41,213
      cl6_dc3                    |         986,592|    -224|           1,119,248| 131,462,805|  55,950|   131,462,632|  55,952
        *direct*                 |          28,208|        |              28,208|          59|        |             0|        
        node3                    |         273,264|     -64|             296,944|  26,292,559|  11,190|    26,292,531|  11,191
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          ItemPool               |         173,472|        |             173,472|          13|        |             0|        
          Channel                |             640|     -64|              36,400|  26,292,539|  11,190|    26,292,531|  11,191
        node1                    |         246,416|    -160|             279,440|  26,292,559|  11,190|    26,292,534|  11,191
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          ItemPool               |         146,784|        |             146,784|          11|        |             0|        
          Channel                |             480|    -160|              33,504|  26,292,541|  11,190|    26,292,534|  11,191
        node2                    |         113,072|        |             148,512|  26,292,541|  11,190|    26,292,525|  11,190
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          ItemPool               |          13,344|        |              13,344|           1|        |             0|        
          Channel                |             576|        |              36,016|  26,292,533|  11,190|    26,292,525|  11,190
        node5                    |         113,072|        |             151,136|  26,292,536|  11,190|    26,292,520|  11,190
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          ItemPool               |          13,344|        |              13,344|           1|        |             0|        
          Channel                |             576|        |              38,640|  26,292,528|  11,190|    26,292,520|  11,190
        node4                    |         113,072|        |             149,776|  26,292,538|  11,190|    26,292,522|  11,190
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          ItemPool               |          13,344|        |              13,344|           1|        |             0|        
          Channel                |             576|        |              37,280|  26,292,530|  11,190|    26,292,522|  11,190
        node0                    |          99,488|        |              99,488|          13|        |             0|        
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          Channel                |             336|        |                 336|           6|        |             0|        
      ConnectionStates           |             640|        |                 640|           5|        |             0|

bmq-oss-ci

Build 334 of commit 5dbe10b has completed with FAILURE

bmq-oss-ci

Build 337 of commit d42822e has completed with FAILURE

Signed-off-by: Evgeny Malygin <[email protected]>

678098 · 2024-10-28T16:27:06Z

src/groups/mqb/mqbnet/mqbnet_channel.cpp

                 const bsl::string&        name,
                 bslma::Allocator*         allocator)
 : d_allocators(allocator)
-, d_allocator_p(d_allocators.get(bsl::string("Channel-") + name))
+, d_allocator_p(d_allocators.get("Channel"))


@dorjesinpo, actually, there are 2 allocators used (Channel and ItemPool), and I will add more in the next PR.

678098 requested a review from a team as a code owner October 26, 2024 04:39

bmq-oss-ci bot reviewed Oct 26, 2024

View reviewed changes

678098 force-pushed the 241026_decouple_ItemPools branch from e59f194 to d42822e Compare October 28, 2024 14:16

678098 assigned dorjesinpo Oct 28, 2024

678098 requested a review from dorjesinpo October 28, 2024 14:53

bmq-oss-ci bot reviewed Oct 28, 2024

View reviewed changes

Perf[MQB]: make independent item pools for channels

1dfbaf2

Signed-off-by: Evgeny Malygin <[email protected]>

678098 force-pushed the 241026_decouple_ItemPools branch from d42822e to 1dfbaf2 Compare October 28, 2024 16:01

678098 commented Oct 28, 2024

View reviewed changes

dorjesinpo approved these changes Oct 28, 2024

View reviewed changes

678098 merged commit 7606a27 into bloomberg:main Oct 28, 2024
35 checks passed

678098 deleted the 241026_decouple_ItemPools branch October 28, 2024 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf[MQB]: make independent item pools for channels #479

Perf[MQB]: make independent item pools for channels #479

678098 commented Oct 26, 2024 •

edited

Loading

bmq-oss-ci bot left a comment

bmq-oss-ci bot left a comment

678098 Oct 28, 2024

Perf[MQB]: make independent item pools for channels #479

Perf[MQB]: make independent item pools for channels #479

Conversation

678098 commented Oct 26, 2024 • edited Loading

Summary

Profiler

Isolated benchmark

Counting allocator stats table

bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

678098 Oct 28, 2024

Choose a reason for hiding this comment

678098 commented Oct 26, 2024 •

edited

Loading